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Foreword 


The Software Engineering Laboratory (SEL) is an oiganization sponsored by the National 
Aeronautics and Space Administration/Goddard Space Flight Center (NASA/GSFC) and 
created to investigate the effectiveness of software engineering technologies when applied to 
the development of applications software. The SEL was created in 1976 and has three 
primary organizational members: 

NASA/GSFC, Software Engineering Branch 

University of Maryland, Department of Computer Science 

Computer Sciences Corporation, Software Engineering Operation 

The goals of the SEL are (1) to understand the software development process in the GSFC 
environment; (2) to measure the effect of various methodologies, tools, and models on this 
process; and (3) to identify and then to apply successful development practices. The 
activities, findings, and recommendations of the SEL are recorded in the Software 
Engineering Laboratory Series, a continuing series of reports that includes this document. 

Single copies of this document can be obtained by writing to 

Software Engineering Branch 
Code 552 

Goddard Space Flight Center 
Greenbelt, Maryland 20771 
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SECTION 1— INTRODUCTION 


This document is a collection of selected technical papers produced by participants in the 
Software Engineering Laboratory (SEL) from November 1993 through October 1994. The 
purpose of the document is to make available, in one reference, some results of SEL research 
that originally appeared in a number of different forums. This is the 12th such volume of 
technical papers produced by the SEL. Although these papers cover several topics related to 
software engineering, they do not encompass the entire scope of SEL activities and interests. 
Additional information about the SEL and its research efforts may be obtained from the 
sources listed in the bibliography at the end of this document. 

For the convenience of this presentation, the five papers contained here are grouped into 
three major sections: 

• Software Measurement 

• Technology Evaluations 

• Ada Technology 

The first section (Section 2) includes a study on the analysis of software maintenance 
changes to understand the flaws in the change process and a study on the comparison of fom 
strategies for defining high-level design metrics. Section 3 presents studies on software 
inspection techniques and the SEL’s Quality Improvement Paradigm. A study on simulating 
inheritance in an object-oriented environment appears in Section 4. 

The SEL is actively working to understand and improve the software development process at 
Goddard Space Right Center (GSFC). Future efforts will be documented in additional 
volumes of the Collected Software Engineering Papers and other SEL publications. 
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SECTION 2— SOFTWARE MEASUREMENT 


The technical papers included in this section were originally prepared as indicated below. 

• “A Change Analysis Process to Characterize Software Maintenance Projects,” 
L. C. Briand, V.R.Basili, Y.Kim, andD.R. Squier,Procee<imgJo/r/ie/nrema- 
tional Conference on Software Maintenance, September 1994 

• Defining and Validating High-Level Design Metrics, L. Briand, S. Morasca, and 
V. R. Basili, University of Maryland, Technical Report TR-3301, June 1994 
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A Change Analysis Process to Characterize Software Maintenance Projects 


Lionel C. Briand, Victor R. Basili, Yong-Mi Kim 
Computer Science Department and Institute for Advanced Computer Studies 
University of Maryland, College Park, MD, 20742 


Donald R« Squier 
Computer Sciences Corporation 
System Sciences Division 
Lanham-Seabrook, MD, 20706 


Abstract 

In order to improve software maintenance processes, we 
need to be able to first characterize and assess them 
This task needs to be performed in depth and with 
objectivity since the problems are complex. One 
approach is to set up a measurement program 
specifically aimed at maintenance. However, 
establislvng a measurement program requires that one 
understands the issues and is able to characterize the 
maintenance environment and processes in order to 
collect suitable and cost-effective data Also, enacting 
such a program and getting usable data sets takes time , 
A short term substitute is needed 
We propose in this paper a characterization process 
aimed specifically at maintenance and based on a 
general qualitative analysis methodology. This process 
is rigorously defined in order to be repeatable and usable 
by people who are not acquainted with such analysis 
procedures, A basic feature of our approach is that 
maintenance changes are analyzed in order to understand 
the flaws in the change process. Guidelines are provided 
and a case study is shown that demonstrates the 
usefulness of the approach, 

1 Introduction 

As described in [HV92], numerous factors can affect 
software maintenance quality and productivity, e.g., the 
maintenance personnel experience profile and training, 
the way knowledge about the maintained systems is 
managed and conveyed to the maintainers and users, the 
maintenance organization, processes and standards in 
use, the initial quality of the software source code and 
its documentation. This last factor involves concepts 
such as self-descriptiveness, modularity, simplicity, 
consistency, expandability, and testability. 

Because of the complexity of the phenomena studied, it 
is difficult for maintenance organizations to identify and 
assess the issues they have to address in order to 
improve the quality and productivity of their 
maintenance projects. Each project may encounter 
speciHc difficulties and situations that are not 
necessarily alike across all the organization's 


This work was supported in part by NASA grant NSG- 
5123 


maintenance projects. This may be due in part to 
variations in application domain, size, change 
hequency, and/or schedule/budget constraints. As a 
consequence, each imjea has first to be analyzed as a 
separate entity even if, later on, ctxnmonalities across 
projects may require similar solutions for 
improvement. Informally interviewing the people 
involved in the maintenance process would be unlikely 
to help determine accurately the real issues. 
Maintainers, users and owners would likely each give 
very different, and often contradictory, insights on the 
issues due to a somewhat incomplete and biased 
perspective. 

Establishing a measurement program integrated into the 
maintenance process is likely to help any organization 
achieve an in-depth understanding of its specific 
maintenance issues and thereby lay a solid foundation 
for maintenance process improvement [RUV92]. 
However, defining and enacting a measurement program 
may take time and a short term, quickly operational 
substitute is needed in order to obtain a first quick 
insight, at low cost, into the issues to be addressed. 
Furthennore, defining efficient and useful measurement 
procedures first requires a characterization of the 
maintenance environment in which measurement takes 
place, i.e., organization structures, processes, issues, 
risks, etc. [BR88]. 

This p^)er presents a qualitative and inductive analysis 
methodology for performing objective project 
characterizations and thereby identifying their spe^c 
problems and needs. It is an implementation of the 
general qualitative analysis methodology defined in 
[SS92]. It encompasses a set of procedures which 
allows the determination of cau^ links between 
maintenance problems and flaws of the maintenance 
organization and process. Thus, a set of concrete steps 
for maintenance quality and productivity improvement 
can be taken bas^ on a tangible understanding of the 
relevant maintenance issues. Moreover, this 
understanding provides a solid basis on which to define 
relevant software maintenance models and metrics. 
Section 2 describes the phases, techniques and 
guidelines composing the methodology. Section 3 
presents a case study of an orbit determination system 
maintained by the Right Dynamics Division (FDD) of 
the NASA (joddard Space Right Center for the last 26 
years and stiU used daily for most operating satellites 
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Figure 1: Qualitative Analysis Process for Software Maintenance 


(GTDS: Goddard Trajectory Determinatioa System). 
Tliis study takes place in the framework of the NASA 
Software Engineering Laboratory (NASA-SEL), an 
organization aimed at improving FDD software 
development processes based on measurement and 
empirical analysis. Recently, responding to the 
growing cost of software nDiaintenance, the SEL has 
initiated a program aimed at characterizing, evaluating 
and improving its main tenance processes. This p^>er is 
a first step in this direction. Section 4 outlines the 
main conclusions of the case study and the future 
research directions. 

2 Causal Analysis of Maintenance 

Problems 

In this section, we present a (mainly) qualitative 
methodology that allows for an in-depth 
characterization of maintenance projects at a relatively 
low cost. However, this approach could be easily 
augmented to integrate data collection and analysis and 
could thus provide more quantitative information (but 
at a higher cost). 

2*1 A Qualitative Analysis Process 

This characterization process is essentially an 
instantiation of the generic qualitative analysis process 
defined in [SS92]. Figure 1 illustrates at a hi^ level 
our maintenance specific analysis process. It can be 
seen that it is a combination of both inductive and 
deductive inferences. Inductive inferences are based on 
the collected mfotmation, and deductive inferences occur 
when experimentally validating and refining our 


taxonomies, process models, organizational models and 
working hypotheses. These deductive inferences then 
serve to refine the data collection process, which leads 
to refined and revised inductive inferences. The process 
continues in an iterative fashion. 

We present below a general description of the process 
involved in preparing and performing characterizations 
of maintenance projects. Maintenance is defined here as 
any kind of enhancement, adaptation or correction 
performed on the software system once in operation. At 
the highest level of abstraction, parts of this process do 
not appear specific to maintenance and could also be 
used for development However, the taxonomies and 
guidelines developed to support this process and 
presented in Section 2.2 are specifically aimed at 
maintenance. 

Step 1 focuses on defining the organizational 
structures, i.e., organization entities, their 
communication channels and information flows. The 
process of producing anew release is then described and 
modeled in Step 2. It is important to note that we do 
not address here the issues related to emergency bug 
fixing procedures but only those relevant to regular 
product releases that go into configuration 
management Step 3 maps generic activities into the 
release process in order to specify the type of work 
performed at each stage of the process. Then, a release 
(or several) has to be selected in order to define the set 
of changes on which the analysis will be performed 
(Step 4). In Step 5, relying on the work peiformed in 
Steps 1-3, information about the changes is collected 
and analyzed. Step 6 summarizes and abstracts from the 
results obtained in Step S. 
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Although the steps are defined sequentially, they are 
really iterated within and across steps. As we learn 
more about the organization, we continue to refine the 
characterizatioo models. The organizational and process 
models produced should include enough detail to allow 
Step 5 to be performed, but should not be so detailed as 
to obscure the maintenace process itself. We now define 
the steps in more detail: 

1 Identify the organizational entities with which 
the maintenance team interacts and the organizational 
structure in which maintainers operate, 

1.1 Identify distinct organizational entities, i.e., 
what are the distinct teams involved in the maintenance 
project? Usually, besides the maintainers themselves, 
the following entities are encountered: users, owners, 
QA team, configuration control team, change control 
board. However, their roles and prerogatives can differ 
significantly. 

1.2 Characterize the working environment of each 
entity, i.e., support tools (see tool taxonomy in 
Section 2.2), intei^ tx-ganizational structure. 

1.3 Characterize information flows between 
entities, i.e., what is the type (and amount when data 
available) of information, documentation, source code 
and other software artifacts flowing between 
organizational entities? 

2 Identify the phases involved in the creation of 
a new system release. 

2.1 Identify the phases as defined in the 
environment studied. At this stage, it is important not 
to map an a priori extemal/generic maintenance process 
model and vocabulary. 

2.2 Each artifact (e.g., document, source code) 
which is input or output of each phase has to be 
determined and its content carefully described (see 
document taxonomy in Section 2.2). 

2.3 The personnel in charge of producing and 
validating the output artifacts of each phase have to be 
identified and located in the organizational structure 
defined in Step 1. 

3 Identify the generic activities involved in each 
phase. 

3.1 Select (from the literature [CSS, BC!91]) or 
define a taxonomy of generic activities based on widely 
accepted definitions and used in the maintenance 
process. As a guideline, such a taxonomy is proposed 
in the next section. 

3.2 Map these activities into each phase by 
reading the technical documents produced and 
interviewing the technical project leaders and 
maintainers about their real work habits. If possible. 


collect effort data for each activity so that the 
importance of each activity in each phase can be 
assessed somewhat quantitatively. 

4 Select one or several past releases for analysis. 

We need to select releases on which we can analyze 
problems as they are occuring and thereby better 
understand process and organization flaws. However, 
because of time constraints, it is sometimes more 
practical to work on past releases. We present below a 
set of guidelines for selecting them: 

• Recent releases are preferable since 
maintenance processes and organizational structure 
might have changed and this would make one's analysis 
somewhat irrelevant. 

• Some releases may contain more complete 
documentation than others. Documentation has a very 
important role in detecting problems and cross-checking 
the information provided by the maintainers. 

• The technical leader(s) of a release may have 
left the company whereas another release's technical 
leader may still be contacted. This is a crucial element 
since, as we will see, the causal analysis process will 
involve project technical leader(s) and, depending on 
his/her/their level of control and knowledge, possibly 
the maintainers themselves. 

5 Analysis of the problems that occurred while 
performing the changes of the selected releases. 

For each change (i.e., error correction, 
enhancement, adaptation) in the selected release(s), the 
following information should be acquired by 
interviewing the maintainers and/or technical leaders 
and by reading the related documentation (e.g., release 
documoits): 

11. Determine the difficulty or error-proneness of the 
change. 

12. Determine whether the change difficulty could have 
been aUeviated or the error(s) resulting from the change 
avoided and bow? 

13. Evaluate the size of the change (e.g., # components, 
LOCs changed, added, removed), 

14. Assess discrepancies between initial & intermediate 
planning and act^ effort / time. 

15. Determine the human flaw(s) (if any) that originated 
the error(s) or increased the ifficulty related to the 
change. A taxonomy of human errors is proposed in 
Section 2.2. 

16. Determine the maintenance process flaws that led to 
the identified human errors (if any). A taxonomy of 
maintenance process flaws is proposed in Section 2.2. 

17. Try to quantify the wasted effort and/or delay 
generated by the maintenance process flaws (if any). 
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The knowledge and understanding acquired 
through steps 1*3 is necessary in order to understand, 
interpret and fcsmalize the information of type 12, 15 or 
16. As a guidance in conducting interviews, templates 
of questions will be provided in Section 2.2. 

6 Establish the frequency and consequences of 

problems due to flaws in the organizational structure 
and the maintenance process by analyzing the 
information gathered in Step 5. 

Based on these results, further complementary 
investigations (e.g., measurement based) related to 
specific issues that have not been fully resolved by the 
qualitative analysis process, should be identified. 
Moreover, a first set of suggestions for maintenance 
process imi»ovement should be devised. 

For those steps which are iterative, we map 
the sq>{Hopriate step back into the qualitative analysis 
process (Figure 1). Thereby, we show how our 
characterization process fits into the more general 
qualitative analysis methodology presented above. In 
diis context, a step usually corresponds to a set of 
iterations of the qualitative analysis process. Thus for 
each step we have the input to that step which defines 
the Observational Database (ODB), the output of each 
step which contains the resulting characterization 
models that go into the Interpretative Knowledge Base 
(IKB), and a validation procedure which helps verify 
that the characterization models are correct The pieces 
of information which compose the ODB are given in 
decreasing osda of importance at each step. The order 
and content of the ODB varies at each step since the 
analysis focus is progressively shifting [SS92]. 

Step 1: Model organizational structures 

Input: maintenance standards definition document 

interviews, sample of release documents, organization 
chart 

Output: organizational model (roles, agents, teams, 
information flow, etc.) 

Validation: 

. Are all the standard documents and artifacts 
included in the modeled information flow? 

. Do we know who produces, validates, and 
certifies the standard documents and arti&cts? 

. Are all the people referenced in the release 
documents a part of the organization scheme? 

Steps 2, 3: Model process and map activities into 

process phases 

Input: maintenance standards definition document, 

interviews, release documents 
Output: process model 
Validation: 


Are all the people in the process model a part 
of the organization sdiexne? 

Do the documetus and artifacts included in the 
process model match those of the information flow of 
the (Hganization model? 

Is the mapping between activities and phases 
complete, i.e., exhaustive set of activities, complete 
mapping? 

Are the taxonomies of maintenance tools, 
methods, and activities adequate, i.e., unambiguous, 
disjoint and exhaustive classes? 

Step 5: Perprm causal analysis 

Input: interviews, change request forms, release 

documents, organization model, process model, 
maintenance standards definition document 
Output: causal analysis 
Validation: 

. Are the taxonomies of errors and 
maintenance fvocess flaws adequate, i.e., unambiguous, 
disjoint and exhaustive classes? This is checked against 
actual change data and validated during interviews with 
maintaineis. 

2.2 Guidelines and Taxonomies 

This section presents a set of guidelines aimed at 
facilitating the characterization process described in the 
previous section. These guidelines are mainly 
composed of taxonomies distinguishing maintenance 
activities, oiors and maintenance process flaws. In 
addition, a set of questions which can be used during 
maintainers' interviews and for each change is provided. 

Step I: Identfy organizational entities 

Taxonomies of Maintenance Tools and Methods (Step 

1 . 2 ) 

The maintenance tools and methods available to 
maintain ers can be used to understand the maintenance 
process, aixl idoitify potential sources of {xoblems. The 
following paragr^bs represent the first level of 
abstraction of environment characteristics' taxonomies 
that should be used to characterize the change 
fiamewoik: 

• Maintenance Tools: Impact analysis & 
planning tool ; Tools for automated extraction & 
representation of control and data flows ; Debugger ; 
Cross-referencer ; Regression testing environment (data 
generation, execution, and analysis of results) ; 
Information system linking documentation and code. 

• Maintenance Methods are characterized by the 
following taxonomy: rigorous impact analysis, 
planning, and scheduling procedures ; Systematic and 
disciplined update procedures of the user and system 
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documentation ; Communication channels and 
procedures with the users ; 

A Taxonomy of Maintenance Documentation (Step 
U) 

The type of documentation related to a software system, 
which may be available to maintainers, can be defined 
by a generic taxonomy as shown below. 
E^umentation has been des^bed as one of the most 
important factors affecting the maintainabUity of a 
software system [HA93, P94], This is why it is 
important to define precisely what should be contained 
in a complete set of documentation (either on-line or 
off-line) for maintenance. Such a taxmomy can be used 
as a guideline to define the maintenance organization. 
Also, when some of these documents appear to be 
missing, potential sources of maintenance problems 
may be identified. Based on the literature [BC91] on the 
subject and our own experience, we propose the 
following taxonomy: 

• Product-related: Software requirements specifications ; 
Software design specifications ; Software product 
specifications 

• Process-related: Test plans ; Configuration 
management plan ; (^ality assurance plan ; Software 
development plan 

• Support-related: Software user's manual ; Computer 
systems operator's manual ; Software maintenance 
manual ; Firmware support manual 

Step 3: Identify the generic activities involved in each 
phase. 

Generic Description of Maintenance Activities (Step 
3.1) 


Acronym Activity 


DET 

Determination of the need for a 
change 

SUB 

Submission of change request 

UND 

Uivderstanding requirements of 
changes: localization, change 
design prototype 

lA 

Impact analysis 

CBA 

Cost/benefit analysis 

AR 

Approval/Rejection/pricHity 
as^gnment of change request 

SC 

Scheduling of task 

CD 

Change design 

CC 

Code changes 

UT 

Unit testing of modified parts 
i.e., has the change been 
implemented? 

rr 

integration testing, 



i.e., does die changed 

part interface conecdy with the 

system? 

RT 

Egression testing, 

i.e., does the change have any 

unwanted side effects? 

AT 

Acceptance testing 

i.e^ does the new release fulfill 

the system requiiements? 

USD 

Upd^ system & user 
documentatkn 

SA 

Standards characterizadoDs; 
quality assurance procedures 

IS 

l^tallation 

PIR 

Post-installation review of 
changes 

EDU 

Educadon/training regarding the 
tqqplicadon domain/system 


All these activities usually contain an overhead of 
communication (meeting + release document writing) 
with owner/users, management hierarchy and other 
maintainers which should be estimated. This is 
possible, through data collection or by interviewing 
maintainers (e.g., Delphi method). 

Step 5: Peiform causal analysis 

Questions asked for each change in selected release(s) 
(Items II-I4) 

The following list describes a set of questions for 
which answers can be provided by maintainers and/or 
release standard documents. These questions attempt to 
c^ture the information necessary for the identification 
of maintenance process flaws. 

I - Description of the change 

1.1 Localization 

subsystem(s) affected 
module(s) affected 
inputs/outputs affected 

1.2 Size 

LOCs deleted, changed, added 
Modules examined, deleted, changed, 
aUed 

1.3 Type of change 

. Preventive changes: improvement of clarity, 
maintainability or documentation. 

. Enhancement changes: add functionalities, 
optimization of space/time/accuracy 
. Adaptive changes: adapt system to change of hardware 
and/or platform 

. Corrective changes: corrections of development errors. 
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2 - Description of the change process 

2.1 effort, el^ised time 

2.2 mamtainei's e:q)ertise and experience 

How long bas the person been working 
on the system 

How long has the person been working 
in this ^plication domain? 

2.3 Did the change generate a change in any document? 
Which docoment(s)? 

3 - Description of the problem 

3.1 Were some errors committed? 

Descrqttion of the errors (see 
taxonomies below) 

Perceived cause of the errors: 
maintenance process flaw(s) 

3.2Di£Gculty 

What made the change difOcult? 

What was the most difficult activity 
associated with the change? 

3.3 How much effort was wasted (if any) as a result of 
maintenance process flaws? 

3.4 What could have been done to avoid some of the 
difficulty, errors (if any)? 

Taxonomies of human errors (Item 15) 

Note that we are exclusively refering to errors occuting 
during the maintenance process, not enors resulting 
from the development 

• Errrx' Origin: when did the misunderstanding occur? 

Change requirements analysis 
Change localization analysis 
Change design analysis 
Coding 

• Enw domain: what caused it? 

Lack of application domain knowledge: 
operational constraints (user interface, performance), 
mathematical model 

Lack of system design or implementation 
knowledge: data structure or process dependencies, 
performance or memory constraints, module interface 
inconsistency 

Ambiguous or incomplete requirements 
Language misunderstanding <semantic, 
syntax> 

Schedule pressure 
Existing uncovered fault 
. Oversight 


Determining the origin and cause of the errors will help 
determine their possible causal relationships to 
maintenance process flaws in the taxonomy presented 
below. 

Taxonomy of Maintenance Process flaws (Item 16) 

• Organizational flaws: 

communication: Interface problems, 

information flow "bottlenecks" in the communication 
between the maintainers and the 
users 

management hierarchy 
quality assurance (QA) team 
crxifiguration management team 
roles: 

prerogatives and re^nsibilities are not fully 
defined or explicit 
incompatible reqionsibilities, e.g., 
development and Q A 

process oonfrvmance: no effective structure for 
enforcing standards and processes 

• Maintenanoe methodological flaws 

Inadequate change selection & priority 
assignment process 

Inaccurate methodology for plarming of effort, 
schedule, personnel 

Inaccurate mediodology for impact analysis 
Incomplete, ambiguous protocols for transfer, 
preservation and maintenance of system 
knowledge 

Incomplete, ambiguous definitions of 
change requirements 

Lack of rigor in configuration (versions, 
variations) management and control 
Undefined / unclear regression testing 
success criteria. 

• Resource shortages 

Lack of financial resources allocated, e.g., 
necessary fc»' preventive maintenance, 
unexpected problems unforseen during 
impact analysis. 

Lack of tools providing technical support 
(see previous tool taxonrnny) 

Lack of tools providing management 
support (i.e., impact analysis, planning) 

• Low quality product(s) 

Loosely defined system requirements 
Pocr quality design, code of maintained 
system 

Poor quality system documentation 
. Poes’ quality user documentation 
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• PersooneUrelated issues 

Lack of experience and/or training with 
respect to &e application domain 
La^ of experience and/or training with 
respect to the system requirements 
(hffldware, perfonnance) and design 
Lack of experience and/or training with 
respect to the users' q)eratiODal needs and 
constraints 

In order to demonstrate the feasibility and usefulness of 
the above approach, we present the following case 
study. 

3 A Case Study 

This case study is intended to provide actual exanq)les 
and results of the change causal analysis process 
described in previous sections. We first present the 
maintained system used as a case study. Then, the 
specific maintenance organization and process are 
described in detail acccxxiing to the template p'ovided in 
Section 2.1. Examples of change causal analyses are 
shown and the lessons learned resulting from this 
analysis process are presented. 

3.1 System History and Description 

GTDS is a 26 year old, 250 KLOC, FORTRAN orbit 
determination system. It is public domain software and, 
as a consequence, has a very large group of users all 
over the world. Usually, 1 or 2 releases are produced 
eveiy year in addition to mission specific versions that 
do not go into configuration management right away 
(but are integrated later on to a new version by going 
through the standard release process). Like most 
maintained software systems, very few of the original 
developers are still present in the organization, but the 
turnover of the maintenance team is low compared to 
other maintenance organizations. However, turnover 
sdll remains a crucial issue in this environment 

3.2 Modeling of the Maintenance 
Organization and Processes 

During the process of building a new release of GTDS, 
different organizational entities interact in different 
ways. By performing Step 1 of the chaiacterizadon 
process described in Section 2.1, two types of entities 
and five types of interactions (i.e., differentiated 
according to the purpose of the information flow) were 
identified. 

The entities, teams and groups, are represented 
in Figure 2 by boxes and ellipses, respectively. Teams 


are persistent organizational structures; groups are 
composed of members of several different teams, and 
are dynamic entities in the sense that they only exist 
when group members meet. These groups have been 
designed to facilitate communication between teams and 
decision making. 

In the five interaction types identified, 
information was used for the foUowing purposes: 
decision - decision based on information provided; 
review - review of documents; approval * ipproval of 
documents or plans; transformation < supplied 
information product is transformed into another 
information product; and information - dissemination of 
information. 

Tecans: 

Testers: they present acceptance test plans, 
perform acceptance test and provide change requests to 
the maintainos when necessary. 

Owners / Users: they suggest, control and 
approve performed changes. 

Product Assurance Organization (PAO): They 
control maintainers' work, e.g., conformance to 
standards, attend release meetings, audit delivery 
packages. They have a different management from the 
maintenance team. 

Configuration Management (FDCM): They 
integrate updates into the system. Coordinate the 
production and release of versions of the system. 
Provide tracking of change requests. 

Maintenance management: They grant 
preliminary ^provals of maintenance change requests 
and release definitions. 

Maintainers: They analyze changes, make 
recommendations, perform changes, perform unit and 
change validation testing after linking the modified 
units to the existing system, perform validation and 
regression testing after they get back the recompiled 
system from the FDCM team. 

Groups: 

Software Management Planning Board 
(SMPB): Their main goal is to address management 
issues that run across maintenance projects. For 
example, they help resolve conflicts between owners 
and maintainers and review release planning documents. 
Also, they allow task leaders and higher level managers 
to exchange relevant information about the evolution of 
their respective systems. However, SMPB has no 
official function. The board is composed of the task 
leader, section manager, department manager, and 
operation manager. 
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Infomwtion Row Purpooo: 


II: approval 
12: information 
13: review 
14: information 


15: review 
16: information 
17: 1 
l&deciaion 


19: transformation 
110: transformation 
111: transformation 
112: transformation 


113; review l17:dedaion 

114: transformation 118: information 

116; information 


Figure 2: Information Flow within the Maintenance Organization 


Configuration Control Board and 
Configuration Management Office (CCB/CMO): They 
are officially responsible for all changes to configured 
software and the allocated budget Their goal is to 
ensure that the production of new releases is consistent 
with the long-term goals of the organization. It is 
conposed of high-level managers. 

GTDS user's group: It is a forum for 
discussion of technical issues but has no official 
function. It is composed of users, maintainers, and 
testers. 

The process described below represents our 
understanding of the woiking process for a release of 
GTDS and the moping into standard generic activities. 
This combines the infoimation gained from Steps 2 and 
3 of the characterization process. Phases, their 
associated inputs/outputs and activities are presented 
below. Activity acronyms are used as defined in Section 
2.2. In this case, each phase milestone in a release is 


represented the discusrion, approval and distribution 
of a specific letease document 

1. Change analysis 

Input: change requests from software owner + priority 
list 

Output: Release Content Review (RCR) document 
which contains change design analysis, prototyping, 
and cost/benefit analysis that may result in a priority 
change to be discussed with the software owner/user. 
Activities: UNDR, lA, CBA, CD, some CC, UT and 
IT (for prototyping) 

2. RCR meeting 

Input: Release Content Review document proposed by 
maintainers is discussed, i.e., change priority, content 
of release. 

Output: Updated Release Content Review document 
Activities: AR, SA (QA engineers are reviewing the 
release documents and attending the meeting) 
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5. Solution analysis 

Input: Updated I^lease Content Review document 
Output: devise technical solutions based on prototyping 
analysis they perfonned in Step 1, Release Design 
Review (RDR) 

Activities: SC, CD, CC, UT, (preparation of test 
strategy for) IT (based mainly on equivalence 
partitioning) 

4, RDR meeting 
Input: RDR documentation 

Output: approved (and possibly modified) RDR 
documentation 

Activities: review and discuss CC, UT, (plan for) IT, 
SA 

J. Change implementation and test 

Input: RDR + prototype solutions (phases 1, 3) 

Output: changes are completed ; change validation test 
is p^onned with new compiled components linked to 
unchanged components of die current system version ; 
regression testing is performed on the system 
recompiled frcMn scratch (provided by the FDCM team) 
; a report with the purpose of demonstrating that the 
system is ready for acceptance test is produced: 
Acceptance test readiness review document (ATRR) 
Activities: IT, RT, USD 

6. ATRR meeting 

Input: Acceptance test readiness review document 
Output: The changes are discussed and validated and the 
used testing strategy is discussed. The acceptance test 
team presents its acceptance testing plan. 

Activities: review the current output of IT, S A 

1. Acceptance test 

Input: the new GTDS release and all release 
documentation 

Outputs: A list of Software Change requests (SCRs) is 
provided to the maintainers. These changes correspond 
to inconsistencies between the new release and the 
general system requirements. 

Activities: AT 

Step 1, 2, and 3 required several iterations before there 
was sufficient validation of the resulting 
characterization of the organization, phases and 
documents. As part of Step 2, for each of the standard 
documents generated during the releases of GTDS 
studied, we determine who produces it, who approves 
it, and what additional relevant information and data 
they contain. When doing so, we have to look for 
possible inconsistencies between the organization 
model (Step 1) and the identified {^oducers/approvers of 
the documents. 


• Document I: Release Content Review (RCR): 
Producer: maintenance team 

Approvers: users, maintenance management, CCB 
Contera: 

. change requirement description 
. description of error (if any) that originated the change 
. design of a prototype solution 
. schedule, effcxt pirns 
. impaa analysis assessment 

• Document 2: Release Design Review (RDR): 
Producer: maintenance team 

Approvers: users, CCB 
Content: 

. id^tificadon of modified units 
. a definitive solution is proposed 
. rough cost/schedule estimates 
. testing guidelines: mainly equivalence partitioning 
classes 

. definition of the test success criterion 

• Document 3: Acceptance Test Readiness Review 
(ATRR): 

Producers: maintenance team, acceptance test team (test 
plan) 

Approvers: CCB, testers 
Content: 

. results of test cases and benchmarks (regression 
testing) 

. screen printouts, short reports 
. Acceptance test plan 

• Document 4: Delivery package: 

Producer: maintenance team 
Approvers: CCB 

Contera: 

. cause of error (if any) 

. effcxt breakdown: analysis, design, code, test 
. # components examine^ modified, added, deleted. 

.# Logs modified, added, deleted 

As specified in Step 4 of our process, we selected a 
release for analysis. This release was quite recent, most 
of the documentation identified in Step 2 was available, 
and most importantly, the technical leader of the release 
was available for ad^tional insights and infomation. 

Step 5 involved a causal analysis of the problems 
observed during maintenance and acceptance test of the 
releases studied. These problems were linked back to a 
precise set of issues belonging to taxonomies presented 
in Section 2.2. Figure 3 summarizes Step 5 as 
instantiated fc»- this case study. 
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Inputs 


Outputs 



Causal link 


Figure 3: Causal Analysis in GTDS 


In order to illustrate Step 5, we provide below 
an example of causal analysis for one of the changes in 
the selected release. Implementation of this change 
resulted in 11 errors that were found by the acceptance 
test team, 8 of which had to be corrected before final 
delivery could be made. In addition, a substantial 
amount of rework was necessary. Typically, changes do 
not generate so many subsequent errors, but the flaws 
that were jnesent in this ch^ge are representative of 
maintenance problems in GTDS. In the following 
paragraphs, we discuss only two of the errors generated 
by the change studied. 

• Increased difficulty related to change (rework) 

Description: Initially, users requested an 
enhancement to existing GTDS capabilities (change 
642). The enhancement involved vector computations 
performed over a given timespan. This enhancement 
was considered quite significant by the maintainers, but 
users failed to supply adequate requirements and did not 
attend the RCR meeting. Users did not report their 
dissatisfaction with the design until ATRR meeting 
time, at which time requirements were rewritten and 
maintainers had to perform rework on their 
implementation. This change took a total of 3 months 
to implement, of which at least 1 month was attributed 
to several flaws in the process. 

Maintenance process flaw(s}: (vganizationaL a 
lack of clear definitions of the prerogatives/duties of 
users with respect to release d^ument reviews and 
meetings (roles), and a lack of enforcement of the 
release procedure (process conformance); maintenance 
methodological flaw: incomplete, ambiguous 
definitions of change requirements. 


• Errots caused 1^ change 642 
The implementation of the change itself resulted in an 
error (A1044) found at the acceptance test phase. When 
the correction to A1044 was tested, an error (A1062) 
was found that could be traced back to both 642 and 
A1044. 

A1044 

. Description: Vector cmiq>utatioiis at the endpoints of 
the timespan were not bandied correctly. But in the 
requirements it was not clear whether the endpoints 
should be considered when implementing the solution. 

. Error origin: change requirement analysis 

. Error domain: ambiguous and incomplete 

requirements 

. Maintenance process flaw(s): organizational: 
communication between users and maintainers, due in 
part to a lack of defined standards for writing change 
requirements; maintenance methodological flaw: 
incomplete, ambiguous deHnitions of change 
requirements. 

A1062 

. Description: One of the system modules in which the 
enhancement change was implemented has two 
processing modes for data. These two modes are listed 
in the user manual. When run in one of the two 
possible processing modes, the enhancement generated 
a set of errors, which were put under the heading 
A1062. At the phase these errors were found, the 
enhancement had already successfully passed the tests 
for the other processing mode. The maintainer should 
have designed a solution to handle both modes 
correctly. 

. Error origin: change design analysis. 

. Error domain: lack of ^plication domain knowledge. 
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. Maintenance process fkxw(s): personnel-related: lack of 
experience and/or training with respect to the 
^plication dooiain. 

The next section presents in detail the results of 
performing Step 6. 

3«3 Lessons Learned about the Studied 
Maintenance Project 

The lessons learned are classified according to the 
taxonomy of maintenance flaws defined in Section 22. 
By performing an overall analysis of the change causal 
analysis results (Step 6), we abstracted a set of issues 
classified as follows: 

Organization 

• There is a large communication cost overhead 
between maintainers and users, e.g., release standard 
documentation, meetings, management forms. In an 
effort to improve the communication between all the 
participants of the maintenance process, non-technical, 
communication-oriented activities have been 
emphasized. At first glance, this seems to represent 
about 40% (rough approximation) of the maintenance 
effon. This figure seems excessive, especially when 
considering the apparent communication problems 
(nextparagr^b). 

• Despite the number of release meetings and 
documents, dkagreements and misunderstandings seem 
to disturb the maintenance process until late in the 
release cycle. For example, design issues that should be 
settled at the end of the RDR meeting keep emerging 
until acceptance testing is completed. 

As a result, it seems that the administrative 
process and organization scheme should be investigated 
in order to optimize communication and sign-off 
procedures, especially between users and maintainers. 

Process 

• The tools and methodologies used have been 
developed by maintainers themselves and do not belong 
to a standard package provided by the organization. 
Some ad hoc technology transfer seems to take place in 
order to compensate for the lack of a global, commonly 
agreed upon strategy. 

• The task leader has been involved in the 
maintenance of GTDS for a number of years. His 
expertise seems to compensate for the lack of system 
documentation. He is also in charge of the training of 
new personnel (some of the easy changes are used as an 
opportunity for training). Thus, the process relies 
heavily on the expertise of one or two persons. 

• The fact that no historical database of changes 
exists makes some changes very difficult. Maintainers 
very often do not understand the semantics of a piece of 


code added in a previous correction. This seems to be 
partly due to emergency patching (during a mission) 
which was not controlled a^ cleaned up afterwards (this 
has recently been addressed), a high turnover of 
personnel and a lack of written requirements with 
respect to performance, precision and platform 
configuration constraints. 

• For many of the complex changes, 
requirements are often ambiguous and incomplete, from 
a maintainer's perspective. As a consequence, 
requirements are often unstable until very late in the 
release process. While prototyping might be necessary 
for some of them, it is not recognized as such by the 
users and maintainers. Moreover, there is no well 
defined standard for expressing change requirements in a 
way suitable to both maintainers and users. 

Products 

• System documentation (besides the user's 
guide) is not fully maintained and not trusted by 
maintainers. Source code is currently the only reliable 
source of information used by maintainers. 

• GTDS has a large number of users. As a 
consequence, the requirements of this system are 
complex with respect to the hardware configurations on 
whi^ the system must be able to run, the performance 
and precision needs, etc. However, no requirement 
analysis document is available and maintained in order 
to help the maintainers devise optimal change 
solutions. 

• Because of budget constraints, there is no 
document reliably defining the hardware and precision 
requirements of the system. Considering the large 
number of users and platforms on which the system 
runs, and the rapid evolution of users' needs, this would 
appear necessary in order to avoid confusion while 
implementing dmges. 

People 

• There is a lack of understanding of operational 
needs and constraints by maintainers. Release meetings 
were supposed to address such issues but they seem to 
be inadequate in their current form. 

• Users are mainly driven by short term 
objectives which are aimed at satisfying particular 
mission requirements. As a consequence, there is a very 
limited long term strategy and budget for preventive 
maintenance. Moreover, the long term evolution of the 
system is not driven by a well defined strategy and 
maintenance priorities are not clearly identified. 

As a general set of recommendations and based 
on the analysis presented in this paper, we suggest the 
following set of actions: 

• A standard (that may simply contain 
guidelines and checklists) should be set up for change 
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requirements. Both users and maintainers should give 
their input with respect to the content of this standard. 

• The conformance to the defined release process 
should be improved, e.g., through team building, 
training. In other words, the release documents and 
meetings should more effectively play their specified 
role in the process, e.g., the RDR meeting should 
settle all design disagreements and incmsistencies. 

* The parts of the system that are often changed 
and highly convoluted (as a result of numerous 
modifications) should be redesigned and documented for 
more productive and reliable maintenance. Technical 
task leaders should be able to point out the sensitive 
system units. 

4 Conclusion 

Characterizing and understanding software maintenance 
processes and organizations are necessary, if effective 
management decisions are to be made and if adequate 
resource allocation is to be provided. Also, in orda to 
plan and efficiently organize a measurement program — 
a necessary step towards process improvement 
[BR88] — , we need to better characterize the 
maintenance environment and its related issues. The 
difficulty of performing such a characterization stems 
from the fact that the people involved in the 
maintenance process, who have the necessary 
information and knowledge, caimot perform it because 
of their inherently biased perspective on the issues. 
Therefore, a weU defined characterization process, which 
is cost-effective, objective, and ^plicable by outsiders, 
needs to be devised. 

In this paper, we have presented such an 
empirically refined characterization process which has 
allowed us to gain an in-depth understanding of the 
maintenance issues involved in a particular project, the 
GTDS project. We have been able to gather objective 
information on which we can base management and 
technical decisions about the maintenance process and 
organization. Moreover, this process is general enough 
to be followed in most of the maintenance 
organizations. 

However, such a qualitative analysis is a priori 
limited since it does not allow us to quantify precisely 
the impact of various organizational, technical, and 
process related factors on maintenance cost and quality. 
Thus, the planning of the release is sometimes 
arbitrary, monitoring its progress is extremely difficult, 
and its evaluation remains subjective. 

Hence, there is a need for a data collection 
program for GTDS and across all the maintenance 
projects of our organization. In order to reach such an 
objective, we will base the design of such a 
measurement program on the results provided by this 
study. In addition, we need to model more rigorously 
the maintenance organization and processes so that 
precise evaluation criteria can be delink [SB94], 


This ^tproach will be used to analyze several 
other maintenance projects in the NASA-SEL in order 
to better understand project similarities and differences 
in this environmenL Thus, we will be able to build 
better models of the various classes of maintenance 
projects. 
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Abstract 

The availability of significant metrics in the early phases of Ae software development 
process allows for a better management of the later phases, and a more effective quality 
assessment when software quality can still be easily affected by preventive or corrective 
actions. In this paper, we introduce and compare four strategies for defining high-level 
design metrics. They are based on different sets of assumptions (about the design process) 
related to a well defined etq>erimental goal they help reach: identify error-prone software 
parts. In particular, we define ratio-scale metrics for cohesion and coupling that show 
interesting properties. An in-depth experimental validation, conducted on large scale 
projects demonstrates the usefidness of the metrics we define. 


1 Introduction 

Software metrics can help address the most critical issues in software development and 
provide support for planning, predicting, monitoring, controlling, and evaluating the 
quali^ of both software products and processes [BR88, F91]. Most existing software 
metrics attempt to capture characteristics of software code [F91]; however, software code is 
just one of the artifacts produced during software development, and, moreover, it is only 
available at a late stage. It is widely recognized that the production of better software 
requires the improvement of the early development phases and the artifacts they produce: 

^ This woik was supported in part by NASA grant NSG-5123, UMIACS, and NSF grant 01-5-24845. 
Sandro Motasca was also supported by grants from MURST and CNR. 
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The production of better specifications and better designs reduces the need for extensive 
review, modification, and rewriting not only of code, but of specifications and designs as 
welL As a result, this allows the software organization to save time, cut production costs, 
and raise the final product's quality. 

Early availability of metrics is a key factor to a successful management of software 
development, since it allows for 

• early detection of problems in the artifacts produced in the initial phases of the life- 
cycle (specification and design documents) and, therefore, reduction of the cost of 
change — ^late identification and correction of problems are much mote costly than 
early ones; 

• better software quali^ monitoring from the early phases of die life-cycle; 

• quantitative comparison of techniques and onpirical refinement of the processes to 
vdiich diey are ^lied; 

• mote accurate planning of resource allocation, based upon the predicted error- 
proneness of die system and its constituent parts. 

In this paper, we will focus on high-level design metrics for software systems. A number 
of studies have been published on software design metrics in recent years. It has been 
shown that system architecture has an impact on maintainability and error-proneness 
[HK84, G86, R87, R90, S90, SB91, Z91, AE92, BTH93, BBH93]. These studies have 
attempted to capture the design characteristics affecting the ease of m aintainin g and 
debugging a software system. Most of the design metrics are based on information flow 
between subroutines or declaration counts. We think that, even though it provides an 
interesting insight into the program structure, this should not be the only strategy to be 
investigated, since many other types of program features and relationships are a priori 
worth studying. Moreover, th^ is a need for comparison between strategies in order to 
identify worthwhile research directions and binld accurate prediction models. 

Besides this focus on information flow, most of the existing approaches share two 
conunon characteristics. (1) They define metrics without making clear assumptions about 
the contexts (ie., processes, problem domain, environmental factors, etc.) in which they 
can be applied (wdth the exception of [AE92], where this issue was partially addressed). 
This implies they should have general validify, and be applicable to different environments 
and problem domains. (2) There are not fully explicit goals, for whose achievement the 
metrics are defined. This may cause problems in their application, since they may be 
defined based on implicit assumptions which the context may not satisfy; interpretation, 
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since their meaning is not clear; and validation [IS88, K88], since their relevance with 
re^)ect to a clearly stated goal is not established. 

The definition of universal metrics (like in physical sciences) is an acceptable long- 
term goal, which, however, is only achievable after we gain better insights into specific 
processes from specific perspectives in the short t»m. It is our opinion that the definition 
of a metric should be driven by both the characteristics of the context or family of contexts 
in which it is used, and one or more clearly stated goals that it helps reach. In other words, 
the assumptions underlying the defined metrics should rely on a deep knowledge of the 
context and should be precisely related to a stated goal Afto* this, the defined metrics must 
undergo a thorough experimental validation, to assess their significance and usefulness 
with respect to the stated goals. Last, based on the experimental evidence, metrics may be 
refined and modified, to better achieve the goals and comply with the process 
characteristics. 

The goal of the research documented in this papa* is to define and validate a set of 
high-level design metrics to evaluate the quality of the high-level design of a software 
system with respect to its error-proneness, understand what high-level design 
characteristics are likely to make software error-prone, and predict the error-proneness of 
the code produced 

We introduce four families of metrics, which are based on different types of 
mathematical abstractions of program designs [MGBB90]. In particular, we introduce a 
family of metrics based on data declaration dependency links (Section 2.2.4). Ihis strategy 
allows us to introduce metrics for cohesion (Section 2.2.4.1) and coupling (Section 
2.2.4.2) [F91] that are characterized by interesting properties and are based on consistent 
principles. Such a consistency is important because it should facilitate future research on 
quantitative tradeoff mechanisms between coupling and cohesion, Le., variations can be 
e:q>ressed using consistent measurement units. Otfa^ metric families include: metrics based 
on declaration counts (Section 2.2.1), metrics based on the USES relationships between 
modules [GJM92] (Section 2.2.2), and metrics based on the IS_COMPONENT_OF 
relationships [GJM92] (Section 2.2.3). 

In addition, we experimentally compare and validate the metrics introduced in 
Section 2 on three NASA projects. The results are shown in Section 3. In Section 4, we 
summarize the lessons we have learned, and outline directions for future research activities 
based on these lessons. 
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2 Defining Metrics for High-level Design 


In this section, we first introduce the basic concepts of high-level design and the 
terminology we will use in the paper (Section 2.1). We then define, based on the goals 
stated in Section 1 and context assumptions, four families of high-level design metrics 
(Section 2.2). 

2.1 Basic Definitions 

Our object of study is the high-level design of a software systmn. To define it, we will start 
from its elementary constituents: software modules. 

In the literature, there are two commonly acc^ted definitions of modules. The first 
one sees a module as a routine, either procedural or functional, and has been used in most 
of the design measurement publications [M77, CY79, HK84, R87, S90]. The second 
definition, which takes an object-oriented perspective, sees a module as a collection of 
type, data, and subroutine definitions, Le., a provider of computational services [B087, 
GJM92]. In this view, a module is the implementation of an Abstract Data Type / Object In 
this paper, unless otherwise specified, we will use the term subroutine for the first 
category, and reserve the term module for the second category. Modules are composed of 
two parts: interface and body (which may be empty). The interface contains the 
computational resources that the module makes visible for use to other modules. The body 
contains the implementation details tiiat ate not to be exported. 

At a higher level of abstraction, modules can be sera as the components of higher 
level aggregations, as defined below. 

Defimtton 1: Library Module Hierarchy (LMH). 

A library module hierarchy is a hierarchy whrae nodes are modules and subroutines, arcs 
betwera modules are IS_COMPONENT_OF [GJM92] relationships, and there is just one 
top level node, which is a module. 

0 

In the remainder of this paper, we will define concepts and metrics that can be applied to 
both modules and LMHs, which are the most significant syntactic aggregation levels below 
the subsystem leveL For short, we will use the term software part (sp) to denote either a 
module or an LMH. 

In the high-level design phase of a software system, only module and subroutine 
interfaces and their relationships are defined — module body and subroutine detail de-dgn is 


10022514L 


2-18 



carried out at low-level design time. Therefore, we define the high-level design of a 
software system as follows. 

Defijution 2: High-level Design 

The high-level design of a software system is a collection of module and subroutine 
interfaces related to each other by means of USES [GJM92] and IS_COMPONENT_OF 
relationships. No body information is yet available at this stage. 

0 


2.2 Strategies to Define High-level Design Metrics 

In this section, we investigate several strategies for defining high-level design metrics. This 
appears necessary at this stage of knowledge, where we can only rely on very limited 
theoretical and empirical ground to help us identify interesting concepts, relationships and 
objects of study. One of the results of this investigation is to provide directions to focus our 
research on a smaller set of strategies and concepts. 

Some of the concepts introduced in this section caimot be directly m^ped onto aU 
imperative languages, because not all of them allow the implementation of Abstract Data 
Types/Objects. However, these concepts are shared by many modem programming 
languages. 

As we said in the Introduction, context assumptions are necessary to define metrics 
that are ^plicable and usefuL Therefore, we list a context asstunption for each of the 
metrics of the fom strategies we introduce below. We do not assume that aU of these 
process assumptions are equally important, Le., not all of the process characteristics we 
take into account have an equal impact on software error-proneness. 

2.2.1 Declaration Counts 

These metrics are counts of data declarations, associated with a software part, that are 
imptxted, exported or declared locally. 


Metric 1: Load. 

LocaUsp) will denote the number of locally defined data declaraticms of a software part sp. 
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Assumption A-LD. 

The count of declarations of a software part may be seen as a measure of size, which is 
known to be associated with errors, i.e, the larger the set of declarations, the more likely 
the errors. 

0 


Metric 2: Globed. 

GIobal(sp) will denote the number of external data declarations visible from a software part 

££: ! 


Assumption A-GL 

The larger the number of external declarations visible in a software part, the larger the 
number of external concepts to be understood and used consistently, the higher the 
likelihood of error. 

0 


Metric 3: Scope. 

Scope(sp) will denote the number of external data declarations for which the data 
declarations of a software part sp are visible. | 


Assumption A-SC. 

The larger the number of data declarations in die scope of the software part, the larger the 
number of contexts of use, the more likely it is to be inadequate to f ulfill the needs of the 
declarations in the scope. 

0 


2.2.2 Metrics Based on the USES Relationships 

These metrics capture the dependencies between software parts based on the USES 
relationships of the system. 


Metric 4: Imported Software Parts. 

ISP(sp) will doiote the number of software parts imported and used by a software part sp. 
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Assumption A-ISP. 

The larger the number of used external software parts, the larger the context to be 
understood, the more likely the occurrence of an errcH’. 

0 


{Metric 5: Eiqxnted Software Parts. 

ESP(sp) will denote the number of software parts that use a software part sp. 


Assumption A-ESP. 

The larger the number of contexts of use of a software part, the larger the number of 
services it provides, the more flexible it must be, and, as a consequence, the more likely the 
occurrence of error. 

0 


2.2.3 Metrics Based on the IS.COMPONENT.OF Relationships 

These metrics capture information about the structure of the lS_COMPONENT_OF graph. 


Metric 6: Maximum/Average Depth. 

MaxJDepth(sp) / Avg_Depth(sp) will denote the maximum/average depth of the nodes 
composing a software part sp. 


Assumption A-M/A. 

The larger the depth of a hierarchy, the larger the context information to be known in the 
lower nodes, the more likely the occurrence of error. 

0 


Metric 7: Number of pahs. 

No_Paths(sp) will denote the number of complete paths (from root to leaf) within a a 
software part sp. 


Assumption A-NOP. 

The larger the number of paths, the larger the number of parent, sibling, and child 
relationships to be dealt with, the larger the complexity of the hierarchy, the higher the 
likelihood of error occurrence. 

0 
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2.2.4 Interaction-Based Metrics 


In this section, we focus q>ecifically on the dependencies diat can propagate inconsistencies 
from data declarations to data declarations or subroutines when a new software part is 
integrated in a system. Those relationships will be called inteTactions and will be used to 
define metrics capturing cohesion and coupling within and between software parts, 
respectively. (Literactions linking subroutines to subroutines or data declarations will not 
be considered because they are, in the vast majority of cases, encapsulated in module or 
routine bodies and are therefore not detectable in our framewoik, which only talr^.c into 
account high-level design.) 

Definition 3: Data dedaradon-Data declaration (DD) Interaction. 

A data declaration A DD-interacts with another data declaration B if a change in A's 
declaration or use may catise the need for a change in ffs declaration or use. 

0 

The DD-interaction relationship is transitive. If A DD-int^:acts with B, and B DD-interacts 
with C, then a change in A may cause a change in C, Le., A DD-interacts with C. Data 
declarations can DD-interact with each other regardless of their location in the d^gigTipH 
system. Therefore, the DD-interaction relationship can link data declarations belonging to 
the same software part or to different software parts. 

The DD-interaction relationships can be defined in toms of the basic relationships 
between data declarations allowed by the language, which rqnesent direct DD-interactions 
(Le., not obtained by virtue of the transitivity of interaction relationships). Data declaration 
A directly DD-interacts with data declaration B if A is used in B's declaration or in a 
statement where B is assigned a value. As a consequence, as bodies are not av ailable at 
high-level design time, we will only consider interactions detectable from the Tntftrf a cf»s 

DD-interactions provide a means to represent the dependency relationships between 
individual data declarations. Yet, DD-interactions per se are not able to capture the 
relationships between individual data declarations and subroutines, which are useful to 
understand whedier data declarations and subroutines are rriated to each other and therefore 
should be encapsulated into the same module (see Section 2.2.4.1 on cohesion). 

Definition 4: Data dedaradon-Subroudne (DS) Interacdon. 

A data declaration DS-interacts with a subroutine if it DD-intaacts with at least one of its 
data declarations. 

0 


1 002251 4L 


2-22 



Whenever a data declaration DD-interacts with ax least one of the data declarations contained 
in a subroutine interface, the DS-interaction relationship between the data declaration and 
the subroutine can be detected by examining the high-level design. For instance, from the 
Ada-like code fragment in Figure 1, it is apparent that both type T1 and object OBJECTll 
DS-interact with procedure SRll, since they both DD-interact with parameter PARll, 
procedure SRll's interface data declaration. 


package Ml is 
type T1 is 

OBJECT! 1, OBJECn2: Tl; 

procedure SR11(PAR11: in Tl:=OBJECTll); 

package M2 is 

OBJECnS: Tl; 

type T2 is array (1..100) of Tl; 
OBJECT21: T2; 

procedure SR21(PAR21: in out T2); 
end M2; 

OBJECT22: M2.T2; 
end Ml; 


Figure 1. Program fragment 

For graphical convenience, both sets of interaction relationships will be represented by 
directed graphs, the DD-interaction graph, and the DS-interaction graph, respectively. In 
both gr^hs (see Figme 2, which shows DD- and DS-interaction graphs for the code 
fragment of Hgure 1), data declarations are represented by rounded nodes, subroutines by 
thick lined boxes, modules by thin lined boxes, and interactions by arcs. 

Next, we wiU define high-level design metrics for cohesion and coupling, based on the 
above definitions. It is generally acknowledged that system architecture should have low 
coupling and high cohesion [CY79]. This is assumed to improve the capability of a system 
to be decomposed in highly independent and easy to understand pieces. However, the 
reader should bear in mind that high cohesion and low coupling may be conflicting goals, 
i.e., a trade-off between the two may exist For instance, a software system can be made of 
■<ma11 modules with a high degree of internal cohesion but very closely related to each other 
and, therefore, with a high level of coupling. Conversely, a software system can be 
composed of few large modules, representing its subsystems, loosely related to one 
another, Le., with low coupling, but with a low degree of internal cohemon as well. 
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(a) 


(b) 


Hguie 2. DD-intraacdon (a) and DS-interacdon (b) graphs for the program f ragm ent in 

Hgure 1 

Moreover, high cohesion and low coupling are not the only factors to be taken into account 
when designing a software system. Other issues (e.g., potential reuse) must be taken into 
account as well 

2.2. 4.1 Cohesion 

Cohesion captures the extent to which, in a software part, each group of data declarations 
and subroutines that are conceptually related beltmg to the same module. Based on 

• an assumption {A-CH), which provides the rationale to define cohesion medics; 

• the concept of cohesive interactions, Le., those interactions which contribute to 
cohesion; 

• a set of properties (Properties 1-3) that cohesion metrics must have in order to 
measure cohesion 

we now introduce a set of metrics (Metrics 8-11) to measure the degree of cohesion of a 
software part 

Assumption A-CH: 

A high degree of cohesion is desirable because information related to declaration and 
subroutine dependencies should not be scatt»:ed across the system and among irrelevant 
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information. Data declarations and subroutines which are not related to each other should 
be encapsulated to the extent possible into different modules. As a result of such a strategy, 
we expect the software parts to be less error-prone. 

0 

Consistently with the definition of Abstract Data Type/Object, data declarations and 
subroutines should show some kind of interaction between them, if they are conceptually 
related. Therefore, we are interested in evaluating the tightness of die interactions between 
the data declarations and d^ta declarations or subroutines declared in a module interface. 
We will capture this by means of cohesive interactions. 

Definition 5: Cohesive Interaction. 

The set of cohesive interactions in a module m, denoted by CI(m), is the union of the sets 
of DS-interactions and DD-interactions, with the exception of those DD-interactions 
between a data declaration and a subroutine formal parameter. 

0 

We do not consider the DD-interactions linkin g a data declaration to a subroutine parameter 
as relevant to cohesion, since they are already accounted for by DS-intaactions and we are 
interested in evaluating the degree of cohesion between data declarations and routines seen 
as a whole. Furthermore, cohesive interactions occur between data declarations and 
subroutines belonging to the same module. Interactions across different modules are not 
considered cohesive, since cohesion is the extent to which data declarations and 
subroutines that are conceptually related belong to the same module. Int^ctions across 
different modules contribute to coupling. Therefore, given a software part sp, the sets of 
cohesive interactions of its constituent modules (if any) are di^oinL 

Remark. 

It is worth reminding the reader that those relationships that carmot be detected by 
inspecting the interfaces, i.e., global variables interacting with subroutine bodies, can 
actually be quite relevant to cohesion evaluation, because they often represent the 
cormections between an object and the subroutines that manipulate it This issue will be 
further discussed later in this section. 

0 
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We base our cohesion metrics for software parts on cohesive interactions. Before defining 
them, we introduce the following three properties that they must satisfy in order to match 
our assumptions^. 

Property 1: Normalizauon. 

Givoi a software part sp, the metric cohesionfsp) belongs to a specified interval [0,MaxJ, 
and cohesion(sp) = 0 if and only if CI(sp) is onpty, and coheskm(sp) = Max if and only if 
CI(sp) includes all possible cohesive interactions. 

0 

Normalization allows meaningful comparisons between the cohesions of different software 
parts, since they all belong to the same interval, and the extreme values of the cohesion 
range must represent the extreme cases. We will denote by M(sp) the maximal set of 
cohesive interactions of the software part sp, Le., the set that includes all of sp's possible 
cohesive interactions, obtained by linking every data declaration to every other data 
declaration and subroutine with which it can interact. Some care must be used in de finin g 
M(sp) for languages that allow circular type d^nitions, such as die ones used to define the 
nodes of a linked list In this case, the declarations of two types T1 and T2 are built in such 
a way that T1 interacts with T2 and T2 interacts with Tl. We choose to coimt only one 
interaction between them. This is explained by the fact that a single intoaction between two 
data declarations justifies their encapsulation in a single module/Abstract Data Type. 

Property 2: Monotomdty. 

Let spi be a software part and CI(spi) its set of cohesive interactions. If sp2 is a modified 
version of spi with the same sets of data and subroutine declarations and one more 
cohesive interaction so that CI(sp2) includes CI(spj), then cohesion(sp2) ^ cohesion(spi). 

0 

Adding cohesive interactions to a a software part caimot deoease its cohesion. 

Property 3: Cohesive Modules. 

Let sp be a software part, and let mi and m2 be two of its modules. Let jp'be die software 
part obtained from sp by merging the declarations belonging to m/ and m2 into a new 
module m. If no cohesive interactions exist between the declarations belonging to mi and 
m2 when they are grouped in m, then cohesion(sp) > cohesion! sp') - 

0 

^Properties and metrics can be defined fOT module sets more general thiwi software parts. However, for 
sinq>lici9, we will provide diem only for software parts. 
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Splitting two sets of declarations which are not related to each other via cohesive 
interactions into two separate modules cannot decrease the cohesion of the software part 

Based on the properties defined above, we introduce a cohesion metric for software 

parts. 


Metric 8: Ratio of Cohesive Interactions (RCI)for a Software Part 
The Ratio of Cohesive Interactions for sp is 


RCI(sp)= 


OCsp)! 

IM(sp)l 


(♦) 


It is straightforward to prove that RCI(sp) satisfies the above properties 1-3, and that, 
based on properties 1-3, it is defined on a ratio scale [F91]. Furthermore, RCI(sp) can also 
be computed as a weighted sum of the RCI(m)'s of the single modules m belonging to sp. 
From Formula (*), since cohesive interactions only occur within modules, but not across 
modules 

ICI(sp)l = SO(m)l 
m € sp 

IM(sp)l = XIM(n)l 
n € sp 


SO 


Ra(sp)= 


S: 

m e s 


0(m)l 

XIM(n)l 
yi€ sp 


By multiplying and dividing each term in the summation by IM(m)l, we obtain 
IM(m)l ICI(m)l XT’ M(m)l 


Ra(sp)= 


2 ; 


£IM(n)l IM(m)l 


m e siP ^ *P 


X; 


ZIM(n)l 


RCI(m) 


m € sV ^ 


The weights represent the potential contribution of each module m belonging to the 
software part ^ to the cohesion of the whole sp. 


10022514L 


2-27 





Figure 3 shows an example of cohesion computation for a single module. T denotes 
a type declaration, C a variable declaration, and SRI, SR2, and SR3 subroutine 
declarations. 



RCI = 4/7 = 0.671 


Figure 3. Cohesion example 

Based on the above cohesion metric, we can define a threshold for deciding whether a set 
of data and subroutines should be kept in one single module or divided into two or more 
modules. For simplicity, we will show here only the case in which we have to decide 
whetho* the declarations belonging to a module m should be split into two modules mi and 
m2. This should be the case if the cohesion of the software part consisting of the two 
modules mi and m2 is greater than the cohesion of module m, Le., 

0(mi)l+0(m2)l 0(mi)l+ia(m2)l+0i2l 

IM(mi)l+IM(m 2 )l ^ IM(m)l 

where O12I is the number of cohesive interactions between the declarations belonging to 
modules mi and m2 when they are in module m. Based on the above inequality, we can 
define a threshold on idi2l> as follows 

qM(m)l-IM(mi)l-IM(m2)D (ia(mT)l+0(m2)D ^ ^ , 

IM(mi)l+IM(m2)l ^ 

We want to emphasize, however, that, since cohesion is not die only characteristic relevant 
to software design, its increase should not be used as the only criterion on which to base 
such a decision. 
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The Role of Additional Information 

Additional information to what is visible in the interfaces may be available at the end of 
high-level design. For instance, given the interface of a module m, the designers have at 
least a rough idea of which objects declared in m will be manipulated by a subroutine in 
m's interface. It will be left to the person responsible for the metric program to decide 
whether or not it is worth collecting this kind of information, thus making the designer 
describe which objects win be accessed by which subroutines. Formatted comments may 
be a convenient way of conveying this information through module interfaces and therefore 
of automating the collection of this t 5 T>e of information. 

For instance, from the code fragment in Figure 1, we c ann ot teU whether 
OBJECT12 DS-interacts (as a global variable) with subroutine SRll. In this case, 
designers can answer in three different ways: 

(1) OBJECTl 2 DS-interact with Pll 

(2) OBJECTl 2 does not DS-interact with Pll 

(3) the information they have is not sufficient 

It is worth sa}dng that answers of kind (2) provide valuable, though negative, information 
on the DS-interactions present in a system. For instance, in the code fragment on Figure 1, 
the designer may indicate the existence of a DD-interaction between object OBJECT12 and 
PARll and the lack of interaction between OBJECTl 3 and PAR21. As a consequence, the 
computation of cohesion is affected. If we take into account this additional information, 
other alternative cohesion metrics can be defined. 

Given a software part sp, and a pair <A,B>, where A is a data declaration and B is 
either a data declaration or a subroutine, we will say that the interaction between them is 
known if it is detectable from the high-level design or is signaled by the designers (they 
provide an answer similar to answer (1) above); we will say that the interaction between 
them is unknown if it is not detectable from the high-level design and is not signaled by the 
designers (they provide an answer similar to answer (1) above). 

The set of known interactions of a software part sp will be denoted by K(sp), and 
the set of unknown interactions by U(sp). In general, \M(sp)\ > \K(sp)\ + \U(sp)\, since 
some interactions are not detectable from the high-level design and the designers explicitly 
exclude their existence. 
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Metric 9: Neutral Ratio of Cohesive Interactions (NRCI). 
Unknown CIs are not taken into account 


NRa(sp)= 


IK(sp)l 

lM(sp)l-IU(sp)l 


Metric 10: Pessimistic Ratio of Cohesive Interactions (PRCI). 

Unknown CIs are considered as if they woe known not to be actual interactions. 

PRCI(sp) = ^^ 


(This is equivalent to RQ(sp).) 


Metric 11: Optimistic Ratio of Cohesive Interactions ( ORCI). 

Unknown (Us are considered as if they \K^ere known to be actual interactions 


ORC3(sp)= 


IK(sp)l + IU(sp)l 
IM(sp)l 


The above three metrics satisfy Properties 1-3, where CI(sp) is replaced by 
K(sp) u U(sp). 

If PRCl(sp), NRCl(sp), and ORCI(sp) are all not undefined, it can be shown that, 
for an software parts sp, 

o^Racsp) < NRa(sp) < ORa(sp)<i 

ORCI(sp) and PRCI(sp) provide the bounds of tiie admissible range for cohesion, and 
NRC3(sp) takes a value in between. It can also be ^own that the smaUer the number of 
unknown intoactions, the smaller the interval [PRCI, ORCQ, Le., the more complete the 
information, the narrower the uncertainty intervaL It should be noted that, once the low- 
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level design is completed, accurate and complete information about cohesive interactions 
should be available. 

Remark. 

NRCI(sp) is undefined if and only if all interactions are unknown, Le., no information is 
available on cohesive interactions. It is interesting to notice that in this case, and only in this 
case, PRCI(sp) = 0 and ORCI(sp) = 1, Le., PRCI(sp) and ORCI(sp) do not provide 
stricter bounds than the ones provided by the interval for cohesion. The fact that NRCI(sp) 
is undefined can be interpreted as the possibility that NRCI(sp} can take any value in the 
interval [0,1]. 

2. 2. 4. 2 Coupling 

In this section, we first give general definitions and assumptions on coupling, then, we 
present a set of metrics, and discuss the issue of genericity in the context of coupling. To 
address the particular issue of coupling, we will refer to the import interactions of a module 
m as all interactions going from a declaration outside m to a declaration inside m. Similarly, 
we define export interactions as going from a declaration located inside m to a declaration 
outside m. 

Assumption A-IC: 

The more dependent a software part on external data declarations, the more external 
information needs to be known in order to make the software part consistent with the rest 
of the system. In other words, the larger the amount of external data declarations, the more 
incomplete the local description of the software part interface, the more spread the 
information necessary to integrate a software part in a system. Thus, the software part 
becomes more error-prone. 

0 


Definition 6: Import Coupling of a software part (IC). 

Import Coupling is the extent to which a software part depends on imported external data 
declarations. 

0 
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Assumption A-EC: 

Export coupling is related to how a software part is used in the system. The more often the 
software part is used, the larger the number of services it has to provide, the more flexible 
it needs to be, e.g., generic module. This may lead to ^rors. 

0 


Definition 7: Export Coupling cfa software part (EC). 

Export coupling is die extent to which die data declaradons of a software part affect die data 
declarations of the other software parts in the system. 

0 

Import and export coupling of a software part will be repressed in terms of the actual DD- 
interactions involving imported external data declarations and die internal data declarations 
of the software part We now provide prop^ties diat must be satisfied by both import and 
export coupling metrics. 

Property 4: Non negativity 

Given a software part sp, the metric import_coupling(sp) ^0 (re^. export_coupling(sp) > 
0). import_coupUng(sp) = 0 (resp. export_jcoupling(sp) = 0 if and only if sp does not have 
import (resp. export) interactions with other software parts. 

0 


Property 5: Monotonicity 

Let m/ be a module and Il(mi) (resp. El(mj)) its set of import (resp. export) interactions. 
If m2 is a modified version of mi with the same sets of data and subroutine declarations 
and one more import (resp. export) interaction so that TI(m2) (lesp. EI(m2)) includes 
II(m2) (resp. EI(m2)\ then import_coupling(m2) ^ import_coupling(mi) (resp. 
export_coupUng(m2) ^ e:q>ort_coupling(mi)). 

0 

Adding import (resp. export) interactions to a module caimot decrease its import (resp. 
export) coupling. 

Property 6: Merging of Modules 

The sum of the couplings of two modules is no less than the coupling of the module which 
is composed of the data declarations of the two modules. 

0 
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This stems from the fact that two modules may contain interactions between each other's 
declarations, which are no longer import or export interactions for the module resulting 
from merging the ori ginal modules. 

It should be noted that, as opposed to cohesion, coupling is not a normalized 
metric. This comes from assumptions A-CH, A-IC, and A-EC (see Sections 2.2.4. 1 and 
2.2.4.2), where we state that cohesion is a degree of interdependence within a software 
part, ^ndiereas coupling is a quantity of dependencies between a software part and the rest of 
the system. 

We will now introduce interaction-based coupling metrics. The issue will be first 
addressed by ignoring generic modules for the sake of simplificatioiL (jeneric modules and 
their impact on die d^ned metrics will be treated later in this section. 


Metric 12: Import Coupling 

Given a software part sp. Import Coupling of sp (denoted by IC(sp}) is the number of DD- 
interactions between data declarations extanal to sp and the data declarations within sp. 


Metric 13: Export Coipling 

Given a software part sp. Export Coupling of sp (denoted by EC(sp)) is the number of DD- 
interacdons between the data declarations within sp and the data declarations external to sp. 


It is straightforward to prove that lC(sp) and EC(sp) satisfy the above properties 4-6, and 
that, based on properties 4-6, these metrics are defined on a ratio scale [F91]. 

Each box in Hgure 4 represents a module interface. Module interfaces m2 and m3 
are located in their parent's interface ml. m2 is assumed to be declared before m3 and 
therefore visible to m3. Hj and OB JECIij data declarations rqiresent respectivdy types and 
objects in module mi FP3 represents a subroutine fonnal parameter. The IC and EC values 
for die modules in Figure 4 are computed as follows. 

IC(ml) = 0 EC(ml)=10 

IC(m2) = 4 EC(m2) = l 

IC(m3) = 5 EC(m3) = 0 

IC(m4) = 2 EC(m4) = 0 

In the example of Figure 4, we see that ml expectedly shows the largest exptnt coupling. 
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Hguie 4. Calculation of IC and EC with non-generic modules only 

Based on the definitions of IC(sp) and EQsp), we derive four rdated metrics, DIC(sp) 
(Direct Import Coupling), 7IC(sp) (Transitive Import Cotq>ling), DEC(sp) (Direct Export 
Coupling), TEC(sp) (Transitive Export Coupling). DIQsp) and DEC(sp) only take into 
account direct interactions, whereas JIQsp) and TEQsp) only take into account transitive 
interactions. By their definitions, IC(sp) = DIC(sp) + TIC(sp), and 
EC(sp) = DEQsp) + TEQsp). This allows us to sq)aratdy evaluate the impact of direct 
and transitive interactions on enror-proneness, as we show in the experimental validation. 
In practice, the number of transitive interactions turns out to be much bigger than that of 
direct interactions, so IQsp) ==TIQsp) and EQsp) = TEQsp). 

The Treatment of Generic Modules 

There are two possible ways of taking into account gaieties when calculating coupling. 
Either each instance can be seen as a different module or a generic can be seen as any other 
module whose scope/global data declarations is/are the uition of the scope/global data 
declarations of its instances. The second solution does not consider instances as 
independent modules and appears to be mote suitable to our ^lecific perspective, since 
errors are to be found in generics and, only as a consequence, in instances. 

The import coupling of a generic module is the cardinali^ of the union of the sets of 
DD-inmractions betweoi the data declarations in the software system and those of each of 
its instances. When calculating export coupling, we take into account the DD-intetactions 
between the data declarations of each of its instances and those of the software system. 
Consistent with the definition of DD-interaction, genetic formal parameters DD-interact 
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with didr particular generic actual parameters (Le. type, object) when the gaieiic module is 
instantiated, since a change in the fonner may imply a change in the latter. 

This is what the example in Hgure 5 illustrates. Gen_m is the interface of a generic 
module, with a generic formal parameter GenFP and a generic type GenT. The export 
coiq>ling of module Gen_m is given by the sum of three parts 

• two interactions from Gen_m to mi, due to the two instantiations, Gen_m(l) and 
Gen_m(2), of Gen_m in ml, 

• the interaction from the instantiation Gen_m( 1) 

• the two interactions from the instantiation Gen_m(2). 

IC(ml) = 2 EC(ml) = 4 

IC(m2) = 3 EC(m2) = 0 

IC(m3) = 4 EC(m3) = 0 

IC(Gen_m) = 0 EC(Gen_m) = 5 



Figure 5. Generics when calculating coupling 
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3 Experimental Validation 


The experimental validation has two main goals. 

Goall 

We want to find out which of the metrics defined above have a significant impact on the 
enor-proneness of software parts. This allows us to 

a. prove that high-level design infoimatian can be used to build significant indicators 
of software eiror-proneness 

b. detmmine which of our assumptions about the developmmit process (Section 2) are 
e:q)eiimentally supported 

c . compare the four strategies for defining high-level design metrics 

d. identify the most promising research directions. 

0 

Goal2 

We need to investigate dependencies between metrics, in order to determine which ones are 
complementary, and can be used in combination, and which ones capture similar 
phenomena. 

0 

Section 3.1 presents the experimental design of die analysis, the project data sets used and 
the tool built to capture the discussed design metrics. Section 3.2 provides and discusses 
the results of a univariate analysis of the metrics. The significance of the metrics as 
predictors of mor-prone software parts is assessed and the differences between systems 
are investigated. Section 3.3 investigates the results obtained when building multivariate 
classification models for detecting error-prone LMHs based on significant design metrics. 
The model results are assessed and the model functional structure is investigated. 

,3.1 Experiment Design 
E:q>eriment Layout 

In order to validate software measurement assumptions experimmitally, one can adopt two 
main strategies: (1) small-scale controlled experiments, (2) real-scale industrial case 
studies. In this research project, we chose the second alternative since we thought the 
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phenomena we are studying would be even more visible and significant on software 
systems of realistic size and complexity. Also, we thought that (2) should be a more 
relevant and convincing v a lidati on for the software industiy practitioners. 

However, the problem in such studies is that it becomes difficult to study the 
phenomena of interest in isolation, without having to deal with other sources of variation. 
In our case, we thought that, if these metrics were to be interesting, they should explain a 
significant percentage of the variation individually or in combination, despite other sources 
of variation. However, we expea some degree of variation across projects. 

Environment 

The first system studied is an attitude ground support software for satellites (GOADA) 
developed at the NASA Goddard Space Flight Center. The second one (GOESIM) is a 
dynamic simulator for a geostationary environmental satellite. These systems are composed 
^76 Ada umts, 90 Klocs and 170 KJocs, respectively, and have a fairly small 
reuse rate (around 5% of source code lines). The third syston we studied (TONS) is an 
onboard navigation system for satellite that has been developed in the samp, environment 
and is about 180 Ada units and 50 Klocs large, with an extremely .gnall rate of reuse (2% 
of source code lines). We selected projects with lower rates of reuse in order to make our 
analysis of design factors more straightforward by removing what we thinV is a major 
source of noise in this context 

Tool 

A tool analyzing the interface parts of Ada source code has been developed in order to 
capture the design characteristics of these systems. This tool is based on LEX&YACC 
[f-'Y92] and extracts generic high-level design information about the visibility and 
interactions of the system declarations. This information is consequently used to compute 
the metrics presented in Section 22, and others tfiat might be HpAtipH 

Analytical Model 

The response variable we use to validate the design metrics is binary, Le., Did an error not 
occur in an LMH? In order to analyze the impact of software metrics on tiie enor-proneness 
of software parts, we used logistic regression, a classification technique [HL89] used in 
many experimental sciences, based on maximum likelihood estimation, and presented 
below. In this case, a careful outlier analysis must be performed in order to make sure that 
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the observed trend is not the result of few observations [DG84]2. In particular, we first 
used univariate logistic regression, to evaluate the impact of each of the metrics in isolation 
on error-proneness. Then, we performed multivariate logistic regression, to evaluate the 
relative impact of those metrics that had been assessed sufficiently significant in the 
uruvariate analysis (e.g., a < 0.20 is a reasonable heuristic). This modeling process is 
further described in [H1^9]. 

A multivariate logistic regression model is based on the following relationship 
equation (the univariate logistic regression model is a ^)ecial case of this, where only one 
variable spears): 


log (^ ^ - ) = Co + CiXi + C 2 X 2 + ... + CnXn 


where is tire probability that no errors were found in a software part during the validation 
phage, and the Xfs are the design metrics included as predictors in the model (called 
covariates of the logistic regression equation). In the two extreme cases, i.e., when a 

variable is either non-significant or entirely differentiates CTror-prone software parts, the 
curve (between p and any single X^-, Le., assuming that all other Xj& are constant) 

approximates a horizontal line and a vertical line respectivdy. In between, the curve takes a 
fl»ible S shape. However, since p is unknown, the co^ficients Cj will be estimated 

through a likelihood function optimization. This procedure assumes that all observations 
are statistically independent. When building the regression equations, each observation was 
weighted according to the number of errors detected in each software part The rationale is 
that each (non) detection of error is considoed as an independent event As a consequence, 
software parts where no errors were detected were weighted 1. 

Goodness-of-fit for such a model is assessed via a statistic called (because 
gimilar in concept to the least-square regression coefficient (tf determination), belonging to 
the intmwal [0,1]. The higher R^, the mtxe accurate the model However, as opposed to the 
R^ of least-square regression, high R^s are rare for logistic regression, for reasons whose 
explanation is well beyond the scope of this text The interested reader may refer to [HL89] 
for a detailed introduction to logistic tegtesricm. 

Tables 1 and 2 contain the results we obtained tiirough, respectively, univariate and 
multivariate logistic regression on the three systems. We rq>ort those rdated to the metrics 


^Id in order to confirm the obtained results, we used tton-paratnetric tests for rank distributions 

as tbe Mann-^Wbitney U test [CAP88]. Results stppeared to be consistent across tedmiques and, in 
order to limit die amount of statistics provided to die reader and preserve die clarity of the text, we only 
show d» results obtained widi logistic regression. 
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that turned out to be the most significant ones across all three projects. For each metric, we 
provide the following statistics: 

• C (appearing in both tables), the estimated regression coefficient The larger the 
coefficient the stronger the impact of the explanatory variable on the probability p. 

• At|f (appearing in Table 1 only), which is based on the notion of odd ratio [HL89], 

and provides an evaluation of the impact of the metric on the dependent variable. 
More specifically, the odd ratio \|/(X) repres^ts the ratio between the probability of 
not having an error and the probability of having an error when the value of the 
metric is X. As an example, if, for a given value X, V(X) is 2, then it is twice more 
likely that the software part does not contain errors than that it does contain errors. 
The value of is computed by means of die following formula 

av=s:2±i) 

v(X) 

Therefore, Ay represents the reductionfincrease in the odd ratio when the value X 
increases by 1 unit. This provides a more intuitive insight than regression 
coefficients into the impact of explanatory variables. (Since the whole range of RCI 
is [0,1], we used one-tenth as the quantum for RCI increase with respect to which 
Ai|r is computed.) 

• a (appearing in both tables), the level of significance, which provides an insight 
into the accuracy of the coefficient esdmales. The significance (a) of the logistic 
regression coefficients tells the reader about the probability for the coefficient to be 
different from zero by chance. Also, the larger the level of significance, the larger 
the standard deviation of the estimated coefficients, the less believable the calc ulate d 
impact of the coefficient The significance test is based on a likelihood ratio test 
[HL89] commonly used in the frameworir of logistic regressiorL 
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3.2 Univariate Analysis 


Results 

As Table 1 shows, all strategies presented in Section 2.2 provide significant metrics, but 
the strategy based on declaration counts. Therefore, these metrics, although based on 
simple and appealing concepts, do not appear to be significant predictors. 

All the metrics based on exported declarations, ie., Local(sp), ESP(sp), EC(sp), 
DEC(sp), and TEC( sp), are not significant. Our explanation is that when an inconsistency 
exists between an exporting module E and an importing module 7, 7 is more likely to be 
corrected, since E may export to other modules. Changing E is likely to require changing 
those other modules. Alternatively, a large amount of e3q>orts sometimes translates into a 
need for genericity but, for many declarations, just results into additional fields and 
dimensions. Therefore, the assumption underlying the export intnactions metric appears 
somewhat questionable. 

All the metrics based on the IS_COMPONENT_OF relation appear significant in 
the univariate analysis. However, they show a strong multicolinearity (Le., the linear 
correlations are strong between metrics). Since Avg_depth is the best predictor in its 
category and in order to minimize the size of Table 1, only the Avg_depth results are 
shown. 

A close analysis of the correlation matrix of the studied metrics shows that these 
results are not due to strong correlations betwem factors, e.g., whmi all factors are jgirft 
predictors. Therefore, all the metrics in Table 1 seem to capture not only significant but 
differmit trends. This shows drat most of the strategies are likriy to be complementary and 
usefuL This is confirmed by the multivariate results presented in Section 3.3. 



Project 

GOADA 

GOESIM 

TONS 

Strategy 

Metrics 

C 

Ay 

a 

C 

At|f 

a 

C 

Ay 

a 

USES 

ISP 

-0.8 

45% 

0.000 

-0.717 

49% 

0.002 

-0.96 

38% 

0.000 

i_c_o 

Avg_Depth 

-2.27 

11% 

0.000 

-2.4 

9% 

0.000 

-3.9 

2% 

0.000 

Inter. 

RCI 

0.63 

188% 

0.000 

0.215 

124% 

0.047 

0.34 

141% 

0.001 

Inter. 

TIC 

-0.016 98.5% 0.001 

-0.017 

98.3% 

0.002 

-0.02 

98% 

0.15 

Inter. 

Die 

-0.23 

79% 

0.000 

-0.19 

83% 

0.001 

-0.04 

96% 

0.19 


Table 1. Univariate Analysis 
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Detailed Discussion 

TIC and DIC do not appear to be significant in TONS (a = 0.19 and 0.15, respectively), 
whereas they are very significant in the two other ^sterns. The analysis of the distribution 
of these factors in all three systems, respectively, ^ows that their standard deviation (d) 
and median (m) are much smaller in TONS, Le., with respect to TIC, <r= 10, m = 2.5 for 
TONS versus <r = 32.74, m = 15.5 for GOADA, g = 32.18, m = 59 for GOESIM. As a 
consequence, any trend related to either DIC or TIC is very likely not to be visible in the 
TONS dataset When considering that TONS is a significantly smaller system than the two 
other ones, results may be interpreted as follows: the distribution of import interactions is 
strongly dependent on the size of the system and input interaction metrics are likely to be 
mediocre predictors for small systems. 

Comparing Models 

From a more general perspective, variations across models (i.e., univariate regression 
equations) should be expected due to differences in project characteristics, i.e., size, 
application domain. However, it is worth noticing that, despite the fact that these projects 
belong to different application domains (within die context of satdHte support systems) and 
have been developed at different times, most of the models are surprisingly stable across 
projects. Because of the functional shape of logistic models, coefficients that may appear 
significantly different actually generate very similar models, e.g.. In Table 1, coefficients 
-2.27 and -3.9 yield Ayls of 1 1% and 2%, respectively. As a consequence, to evaluate the 
stability of the models, the reader should rather look at the A column in Table 1. When 
doing so, only RCI appears to have a noticeable model instability even though the trends 
are consistent 


3.3 Multivariate Models 

In this section, we present the results obtained by performing a stepwise multivariate 
logistic regressioiL Table 2 provides the estimated regression coefficients (C) and their 
significance (a) based on a Wald test [HL89], which is obtained by comparing the 
maximum likelihood estimate of a parameter to its estimated standard deviation. Regression 
coefficients are not shown when their level of sigrtificance is above 0.2 (substituted by a 
♦). 
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Projects 

GOADA 

GOESIM 

TONS 

Strategy 

Metrics 

C 

a 

C 

a 

C 

a 

USES 

ISP 

-0.9 

0.04 

♦ . 

* 

-1.18 

0.000 

i_c_o 

Avg_D^»th 

-L8 

0.003 

-3.12 

0.000 

-5.62 

0.000 

Inter. 

RCI 

0.4 

0.006 

0.3 

0.07 

0.2 

0.16 

Inter. 

TIC 

-0.023 

0.000 

-0.02 

0.005 

* 

* 

Inter. 

Die 

0.23 

0.04 

-0.13 

0.04 

-0.11 

0.002 


Table 2. Coefficients of MuMvariate Models 


Results 

The very low levels of significance in Table 2 suggest that these metrics may be used in 
combination as indicators of error-prone LMHs. Indeed, when used in a multivariate 
model, many of these metrics are still significant and produce models diat are more accurate 
than univariate models (Table 2). The best univariate R^s axe 0.115, 0.20 and 0.16 for 
GOADA, GOESIM, and TONS, respectively. In the same order, the multivariate R^s are 
0.21, 0.24, and 0.43. We can see that the results improved very significantly for GOADA 
and TONS. 

Interaction-based metrics are more complex but wordi collecting, since they are the 
only metrics defined at the declaration level that appeared significant In addition, the 
average LMH depth was consistently selected as a voy good indicator. Ihis is likely to be 
an early measure of "size" of the LMH and is expectedly significant Also, ISP, a metric 
similar to the notion of fan-in shows to be significant across projects (except in the 
multivariate GOESIM model for reasons explained below), while ESP, the equivalent 
measure for exports (based on the fan-out of LMHs) is not significant As a consequence, a 
metric of the form (fan.in • £an_out)^, suggested in numoous occasions in the literature 
[HK84, IS88, S90, Z91], does not appear to be significant From a more general 
perspective, metrics based on imports, regardless of the associated concepts, app&ax to 
predict mote accurately the mor-proneness of software parts. 

Comparing Models 

Some vaxiabili^ in the estimated regression coefficients can be observed across 
projects in Table 2 and requires some discussioit In multivariate modds, coefficients have 
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a tendency to adjust, statistically, for other variables [HL89]. Sometimes, variables are 
weak predictors of the response variable when taken individually, and become more 
significant when integrated in a multivariate modeL In Table 1, DIC showed, for TONS, a 
mediocre level of significance, whereas it appears to be a significant predictor in Table 2. 
Moreover, its coefficient is very imstable across projects and the trend is reversed (positive) 
for GOADA and TONS. When looking more carefully at the associations (not only the 
narrower concept of linear correlation) between metrics, it can be determined that tiiis is the 
results of strong association between DIC and ISP in GOADA and TONS. These 
associations are a typical source of coefficient mstabili^ [DG84], e.g., the coefficient of 
ISP in GOADA varies from -0.9 to -0.39 when DIC is removed from the equation. 

77C remains non-significant because of its strong linear correlation — 0.76) 
with DIC in the TONS dataset Similarly, ISP does not appear significant in the GOESIM 

because of a strong correlation with DIC (R^ = 0.50). RCI in TONS shows a 
weaker significance (a = 0.16) than in the univariate results and no strong linear correlation 
c an be observed with the other metrics included in the multivariate equation. However, 
LMHs with large numbers of imported interactions ate aU located in the low part of the 
cohesion range. Such an association (likely to be spurious since it is not the case in the 
other datasets) with DIC is likely to affect the significance of RCI in a multivariate 
equation. 

It is important to note that a different set of systans showing different distributions 
might show very different trends. This points out a need for large scale investigation across 
various development environments and application domains. 


4 Conclusion 

This study has shown that statistical models of extrwnely good significance can be built 
based on high-level design information. In particular, we have determined accurate early 
predictors for error-prone software. Moreover, the results suggested that, at this stage of 
understanding, several strategies were worth investigating because none of them showed 
dominant trends, while most of them appeared to be complementary. In order to provide 
the practitioner with usable, well understood and validated models, software engineering 
researchers will have to keep refining and validating the existing metrics. There is still 
substantial room for improvement 

The stability of the impact of these metrics across projects allows us to draw 
optimistic conclusions about the use of such quality indicators. Using early quality 
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indicators based on objective empirical evidence appears possible. However, it is very 
likely that this kind of indicators will behave differently across various domains of 
application and development ravironments. 

Therefore, the use of such indicators should always be preceded by a careful 
empirical analysis of local error patterns and a thorough comparison across projects. 

Our future work will be three-fold: 

• Analyze more systems 

• Validate further and refine the metrics we defined in this paper. The variations 
across environments and the study/comparison of difiinent architectures is likely to 
give us interesting insights. 

• Consistent with our current objectives, we will address the issues related to 
building metric based ^pirical models earlier in the life cycle. In partic ular , the 
next stage of this research will focus on defining and validating metrics for formal 
specifications. 
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SECTION 3— TECHNOLOGY EVALUATIONS 


The technical papers included in this section were originally prepared as indicated below. 

• Comparing Detection Methods for Software Requirements Inspections: A Repli- 
cated Experiment, A. A. Porter, L. G. Votta Jr., and V. R. Basili, University of 
Maryland, Technical Report TR-3327, July 1994 

• “Software Process Evolution at the SEL,” V. Basili, S. Green, IEEE Software, 
July 1994 
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Comparing Detection Methods For Software Requirements 
Inspections: A Replicated Experiment 

Adam A. Porter Lawrence .G. Votta, Jr. Victor R. Basili* 


Abstr2ict 

Software requirements specifications (SRS) are usually validated by inspections^ in which several reviewers 
independently analyze all or part of the specification and search for defects. These defects are then collected at a 
meeting of the reviewers and author(s). 

Usually f reviewers use Ad Hoc or Checklist methods to uncover defects. These methods force all reviewers to 
rely on nonsystematic techniques to search for a wide variety of defects. We hypothesize that a Scenario-based 
method, in which each reviewer uses different, systematic techniques to search for different, specific classes of 
defects, will have a significantly higher success rate. 

We evaluated this hypothesis using a Z x 2^ partial factorial, randomized experimental design. Forty eight 
graduate students in computer science participated in the experiment. They were assembled into sixteen, three- 
person teams. Each team inspected two SRS using some combination of Ad Soc, Checklist or Scenario methods. 

For each inspection we performed four measurements: (1) individual defect detection rate, (2) team defect 
detection rate, (S) percentage of defects first identified at the collection meeting (meeting gain rate), and (4) 
percentage of defects first identified by an individual, but never reported at the collection meeting (meeting loss 
rate ). 

The experimental results show that (1) the Scenario method has a higher defect detection rate than either Ad 
Hoc or Checklist methods, (2) Scenario reviewers are more effective at detecting the defects their scenarios are 
designed to uncover, and are no less effective at detecting other defects, (S) Checklist reviewers were no more 
effective than Ad Hoc reviewers, and (4) Collection meetings produce no net improvement in the defect detection 
rate - meeting gains are offset by meeting losses. 


A preliminary version of tliis article entitled, "^An Experinieiit to Assess Different Defect Detection Methods For 
Software Requirements Inspections* , has been selected to appear in the proceedings of the 16*^ International Conference 
on Software Engineering. This article expands on oni previous work in several ways: 

1- We have replicated the mitial experiment - doubling the number of data points. 

2. We have expanded the description of the Scenario detection methods and included appendices containing the fuD 
text of the Ad Hoc, Checklist, and Scenario defect detection aids that were used during the experiment. 

3. Our original analysis analyzed the effect of different detection methods on team performance. With the increased 
number of data points, we axe now able to extend the analysis to determine how these methods influence individual 
performance. This allows us to reject several threats to the experiment’s internal vahdity. 

4. We have added a new section analyzing the how inspection meetings affect inspection performance. Our results 
show that meetings contribute nothing to defect detection effectiveness. 


•This work is supported in part by the National Aeronautics and Space Administration under grant NSG-5123. Porter and Basili 
are with the Department of Compater Science, University of Maryland, College Park, Maryland 20472. Votta is with the Software 
Production Keseardi Department, AT&T Bell Laboratories Naperville, IL 60566 
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1 Introduction 


One of the most common ways of validating a software requirements specification (SRS) is to submit it to an 
inspection by a team of reviewers. Many organizations use a three-step inspection procedure for eliTniTiating 
defects ^ : detection, collection, and repair^. [8, 17] A team of reviewers reads the SRS, identifying as many 
defects as possible. Newly identified defects are collected, usually at a team meeting, and then sent to the 
document’s authors for repair. 

We are focusing on the methods used to perform the first step in this process, defect detection. For this 
article, we define a defect detection method to be a set of defect detection techniques coupled with an assignment 
of responsibilities to individual reviewers. 

Defect detection techniques may range in prescriptiveness from intuitive, nonsystematic procedures, such as 
Ad Hoc or Checklist techniques, to explicit and highly systematic procedures, such as formal proofr of correctness. 

A reviewer’s individual responsibility may be general - to identify as many defects as possible - or specific - 
to focus on a limited set of issues such as ensuring appropriate use of hardware interfaces, identifying untestable 
requirements, or checking conformity to coding st 2 tndards. 

These individual responsibilities may be coordinated among the members of a review team. When they are 
not coordinated, all reviewers have identical responsibilities. In contrast, the reviewers in coordinated teams may 
have separate and distinct responsibilities. 

In practice, reviewers often use Ad Hoc or Checklist detection techmques to discharge identical, general 
responsibilities. Some authors, notably Pamas and Weiss[13], have argued that inspections would be more 
effective if each reviewer used a different set of systematic detection techniques to discharge different, specific 
responsibilities. 

Until now, however, there have been no reproducible, quantitative studies comparing alternative detection 
methods for software inspections. We have conducted such an experiment and our results demonstrate that the 
choice of defect detection method significantly affects inspection performance. Furthermore, our experimental 
design may be easily replicated by interested researchers. 

^ We use the word defect instead of the word/auA even thoxigh this does not adhere to the IEEE Standards on Software En^ecring 
Tenninology [9]. We feel the word fault has a code-specific connotation — only one of the many p l a ces where inspections are used. 

Dcp^ding on the exact fom of the inspection, they are sometimes called reviews or wallcthioughs. For a more thorou^ 
description of the taxonomy see [S] pp. 171# and [10]. 
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Below we describe the relevant literature, several alternative defect detection methods which motivated our 
study, our research hypothesis, and our experimental observations, analysis and conclusions. 

1*1 Inspection Literature 

A summary of the origins and the current practice of inspections may be found in Humphrey [8]. Consequently, 
we will discuss only work directly related to our current efforts. 

Fagan[6] defined the basic software inspection process. While most writers have endorsed his approach[3, 
8], Pamas and Weiss are more critical [13]. In part, they argue that effectiveness suffers becaxise individued 
reviewers are not assigned specific responsibilities and because they lack systematic techniques for meeting those 
responsibilities. 

Some might argue that Checklists are systematic because they help define each reviewer’s responsibilities and 
suggest ways to identify defects. Certainly, Checklists often pose questions that help reviewers discover defects. 
However, we argue that the generality of these questions and the lack of concrete strategies for answering them 
makes the approach nonsystematic. 

To address these concerns - at least for software designs - Pamas and Weiss introduced the idea of active 
design reviews. The principal characteristic of an active design review is that each individual reviewer reauls for a 
specific purpose, using specialized questionnaires. This proposal forms the motivation for the detection method 
proposed in Section 2.2.2. 

1*2 Detection Methods 

Ad Hoc and Checklist methods are the two most frequently used defect detection methods. With Ad Hoc 
detection methods, all reviewers use nonsystematic techniques and are assigned the same general responsibilities. 

Checklist methods are similar to Ad Hoc, but each reviewer receives a checklist. Checklist items capture 
important lessons learned from previous inspections within an environment or application. Individual checklist 
items may enumerate characteristic defects, prioritize different defects, or pose questions that help reviewers 
discover defects,* such as “Are all interfaces clearly defined?^ or “If input is received at a faster rate than can 
be processed, how is this handled?” The purpose of these items is to focus reviewer responsibilities and suggest 
ways for reviewers to identify defects. 
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Figure 1: Systematic Inspection Research Hypothesis. This figure represents a software requirements 
specification before and after a noTisysiemaiic technique, general and identical responsibility inspection and a 
systematic technique, specific and distinct responsibility inspection. The points and holes represent various 
defects. The line-filled regions indicate the coverage achieved by different members of the inspection team. 
Our hypothesis is that systematic technique, spedfic and coordinated responsibility inspections a^eve broader 
coverage and minimize reviewer overlap, resulting in higher defect detection rates and greater cost benefits than 
nonsystematic methods. 


1.3 Hypothesis 

We believe that an alternative approach which gives individual reviewers specific, orthogonal detection responsi- 
bilities and specialized techniques for meeting them wiU result in more effective inspections. 

To explore this alternative we developed a set of defect-specific techniques called Scenarios - collections of 
procedures for detecting particular classes of defects. Each reviewer executes a single scenario and multiple 
reviewers axe coordinated to achieve broad coverage of the document. 

Our underlying hypothesis is depicted in Figure 1: that nonsystematic techniques with general reviewer 
responsibility and no reviewer coordination, lead to overlap and gaps, thereby lowering the overall inspection ef- 
fectiveness; while systematic approaches with specific, coordinated responsibilities reduce gaps, thereby increasing 
the overall effectiveness of the inspection. 


2 The Experiment 

To evaluate our systematic inspection hypothesis we designed and conducted a multi-trial experiment. The goals 
of this experiment were twofold: to characterize the behavior of existing approaches and to assess the potentied 
benefits of Scenario-based methods. We ran the experiment twice; once in the Spring of 1993, and once the 
following Fall. Both runs used 24 subjects - students taking a graduate course in formal methods who acted 


1 002251 4L 


3^6 





as reviewers. Each complete run consisted of (1) a training phase in which the subjects were taught inspection 
methods and the experimental procedures, and in which they inspected a sample SKS, and (2) an experimental 
phase in which the subjects conducted two monitored inspections. 

2.1 Experimental Design 

The design of the experiment is somewhat unusual. To avoid misinterpreting the data it is important to under- 
stand the experiment and the reasons for certain elements of its design 

2.1.1 Variables 

The experiment manipulates five independent variables: 

1. the detection method used by a reviewer (Ad Hoc, Checklist, or Scenario); 

2. the experimental replication (we conducted two separate replications); 

3. the inspection round (each reviewer participates in two inspections during the experiment); 

4. the specification to be inspected (two are used during the experiment). 

5. the order in which the specifications are inspected (either specification can be inspected first). 

The detection method is our treatment variable. The other variables allow us to assess several potential 
threats to the experiment’s internal validity. 

For each inspection we measure four dependent vaLiiables: 

1. the individual defect detection rate, 

2. the team defect detection rate 

3. the percentage of defects first identified at the collection meeting (meeting gain rate), and 

4. the percentage of defects first identified by an individual, but never reported at the collection meeting 
(meeting loss rate), 

^See Judd, et aL [11], dxapter 4 for an excellent discosaon of randomized social experimental designs. 

^The team and individual defect detection rates are the number of defects detected by a team or individual divided by the total 
niunber of defects known to be in the specification. The closer that value is to 1, the more effective the detection method. No defects 
were intentionally seeded into the speciiications. All defects axe naturally occurring. 
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Detection 

Method 


Table 1: This table shows the settings of the independent variables. Each team inspects two documents, the 
WLMS and CRUISE, one per round, using one of the three detection methods. Teams from the first replication 
are denoted lA-lH, teams from the second replication are denoted 2A-2H. 

2.1.2 Design 

The purpose of this experiment is to compare the Ad Hoc, Checklist, and Scenario detection methods for in- 
specting software requirements specifications. 

When comparing multiple treatments, experimenters frequently use fractional factorial designs. These 
systematically explore all combinations of the independent variables, allowing extraneous factors such as 
ability, specification quality, and learning to be measured amd eliminated from the experimental analysis. 

Had we used such a design each team would have participated in three inspection roimds, reviewing each of 
three specifications and using each of three methods exactly once. The order in which the methods are applied 
and the specifications are inspected would have been dictated by the experimental design. 

Such designs are unacceptable for this study because they require some teams to use the Ad Hoc or Checklist 
method after they have used the Scenario-method. Since the Ad Hoc and Checklist methods are nonsystematic, it 
is impossible to define, monitor and enforce their use. Therefore, we were concerned that the use of the Scenario 
method in an early round might imperceptibly distort the use of the other methods in later rounds. 

Consequently, we chose a partial factorial design in which not ail combinations of the independent variables 
are present. With this design, each team participates in two inspections, uang some combination of the three 
detection methods, but teams using the Scenario method in the first round must continue to use it in the second 
round. Taible 1 shows the settings of the independent variables. 

2.1.3 Threats to Internal Validity 

A potential problem in any experiment is that some factor may affect the dependent variable without the re- 
searcher’s knowledge. This possibihty must be minimized. We considered five sudi threats: (1) selection effects. 


Round/ Specification 



Round 1 

Round 2 


WLMS 

CRUISE 

WLMS 




BSSHI 

lA 


checklist 

2B 


IE, 2D, 2G 

IB, IH 



2H 
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(2) maturation effects, (3) replication effects, (4) instrumentation effects, and (5) presentation effects. 

Selection effects are due to natural variation in human performance. For example, random assi gnm ent of 
subjects may accidentally create an elite team. Therrfore, the difference in this team’s natural ability will mask 
differences in the detection method performance. Two approaches are often taken to linut this effect: 

1. Create teams with equal skills. For example, rate each participant’s background knowledge and experience 
as either low, medium, or high and then form teams of three by selecting one individual at random from 
each experience category. Detection methods are then assigned to fit the needs of the experiment. 

2. Compose teams randomly, but require each team to use sdl three methods. In this way, differences in team 
skill are spread across all treatments. 

Neither approach is entirely appropriate. Although, we used the first approach in our initial replication, the 
approach is unacceptable for multiple replications, because even if teams within a given replication have equal 
skills, teams from different replications will not. 

As discussed in the previous section, the second approach is also unsuitable because using the Scenarios in 
the first inspection Round wiU certainly bias the apphcation of the Ad Hoc or Checklist methods in the second 
inspection Round. 

Our strategy for the second replication and future replications is to randomly assign teams and detection 
methods. However, teams that used Scenarios in the first round were constrained to use them again in the 
second round. This compromise efficiently employs the subjects without biasing the performance of any teams. 

Maturation effects are due to subjects learning as the experiment proceeds. We have manipulated the detection 
method used and the order in which the documents are inspected so that the presence of this effect can be 
(discovered and taken into account. 

Replication effects are caused by differences in the materials, participants, or execution of multiple repli- 
cations. We limit this effect by using only first and second year graduate students as subjects - rather than 
both undergraduate and graduate students. Also, we maintain consistency in our experimental procedures by 
packaging the experimental procedures as a classroom laboratory exercise. This helps us to ensure that sunilar 
steps are followed for all replications. 

As will be shown in Section 3, variation in the defect detection rate is not explained by selection, maturation, 
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or replication effects. 

Finally, instrumentation effects may result from differences in the specification documents. Such variation is 
impossible to avoid, but we controlled for it by having each team inspect both documents. 

2.1.4 Threats to External Validity 

Threats to external vahdity hmit our ability to generalize the results of our experiment to industrial practice. 
We identified three such threats: 

1. the reviewers in the first run of our experiment may not be representative of software programming profes- 
sionals; 

2. the specification documents may not be representative of real programming problems; 

3. the inspection process in our experimental design may not be representative of software development prac- 
tice. 

The first two threats are real. To surmoimt them we are currently replicating our experiment using software 
progra mmin g professionals to inspect industrial work products. Nevertheless, laboratory experimeutation is a 
necessary first step because it greatly reduces the risk of transf^ring immature technology. 

We avoided the third threat by modeling the experiment’s inspection process after the design inspection 
process described in Eick, et al. [5], which is used by several development organizations at AT&T; therefore, we 
know that at least one professional software development organization practices inspections in this manner. 

2.1.5 Analysis Strategy 

Our analysis strategy had two steps. The first step was to find those independent variables that individually 
explain a sig nifi cant amoimt of the variation in the team detection rate. This was done by an analysis of 
variance technique as discussed in Box, et al. ([4], pp. 165j9). 

The second step was to evaluate the combined effect of the variables shown to be significant in the initial 
analysis. Again, we foDowed Box, et al. closely ([4], pp. 210ff), 

Once these relationships were discovered and their magnitude estimated, we examined other data, such as 
correlations between the categories of defects detected and the detection methods used that would confirm or 
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reject (if possible) a caiisal relationship between detection methods and inspection performance. 

2.2 Experiment Instrumentation 

We developed several instruments for this experiment: three small software requirements specifications (SRS), 
instructions and aids for each detection method » and a data collection form. 

2.2.1 Software Requirements Specifications 

The SRS we used describe three event-driven process control systems: an elevator control system, a water level 
monitoring system, and an automobile cruise control system. Each specification has four sections: Overview, Spe- 
cific Functional Requirements, External Interfaces, and a Glossary. The overview is written in natural language, 
while the other three sections are specified using the SCR tabular requirements notation [7]. 

For this experiment, all three documents were adapted to adhere to the IEEE suggested format [10]. All 
defects present in these SRS appear in the original documents or were generated during the adaptation process; 
no defects were intentionally seeded into the document. The authors discovered 42 defects in the WLMS SRS; 
and 26 in the CRUISE SRS. The authors did not inspect the ELEVATOR SRS since it was only used for training 
exercises. 

Elevator Control System (ELEVATOR) [18] describes the functional and performance requirements of a 
system for monitoring the operation of a bank of elevators (16 pages). 

Water Level Monitoring System (WLMS) [16] describes the functional and performance requirements of 
a system for monitoring the operation of a steam generating system (24 pages). 

Automobile Cruise Control System (CRUISE) [12] describes the fimctional and performaince require- 
ments for an automobile cruise control system (31 pages). 

2.2.2 Defect Detection Methods 

To make a fair assessment of the three detection methods (Ad Hoc, Checklist, and Scenario) each method should 
search for a weU-defined population of defects. To accomplish this, we used a general defect taxonomy to define 
the responsibilities of Ad Hoc reviewers. 
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Figure 2: Relationship Between Defect Detection Methods. The figure depicts the relationship between 
the defect detection methods used in this study. The vertical extent represents the coverage. The horizontal 
labels the method and represents the degree of detail (the greater the horizontal extent the greater the detail). 
Moving from Ad Hoc to Checklist to Scenario there is more detail and less coverage. The gaps in the Scenario 
and Checklist columns indicate that the Checklist is a subset of the Ad Hoc and the Scenarios are a subset of 
the Checklist. 


The checklist used in this study is a refinement of the taxonomy. Consequently, Checklist responsibilities are 
a subset of the Ad Hoc responsibilities. 

The Scenarios are derived from the checklist by replacing individual Checklist items with procedures de- 
signed to implement them. As a result, Scenario responsibilities are distinct subsets of Checklist and Ad Hoc 
responsibilities. The relationship between the three methods is depicted in Figure 2. 

The taxonomy is a composite of two schemes developed by Schneider, et aJ. [14] and Basil! and Weiss [2]. De- 
fects are divided into two broad types: omission - in which important information is left unstated and commission 
- in which incorrect, redundant, or ambiguous information is put into the SRS by the author. Omission defects 
were farther subdivided into four categories: Missing Functionality, Missing Performance, Missing Environment, 
and Missing Interface Commission defects were also divided into four categories: Ambiguous Information, In- 
consistent Information, Incorrect or Extra Functionality, and Wrong Section. (See Appendix A for complete 
taxonomy.) We provided a copy of the taxonomy to each reviewer. 

Ad Hoc reviewers received no further assistance. 

Checklist reviewers received a single checklist derived from the defect taxonomy. To generate the checklist we 
populated the defect taxonomy with detailed questions culled from several industrial checklists. Thus, they 3 xt 
very similar to checklists used in practice. All Checklist reviewers used the same checklist. (See Appendix B for 
the complete checklist.) 
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Figure 3: Reviewer Defect Report Form. This is a sznall sample of the defect report form completed during 
each reviewer’s defect detection. Defects number 10 and 11, found by reviewer 12 of team C for the WLMS 
specification are shown. 


FinaUy, we developed three groups of Scenarios. Each group of Scenarios was designed for a specific subset 
of the Checklist items: 

1. Data Type Inconsistencies (DF), 

2. Incorrect Functionalities (IF), 

3. Missing or Ambiguous Functionalities (MF). 

After the experiment was finished we applied the Scenarios to estimate how broadly they covered the WLMS 
and CRUISE defects. We estimated that the Scenarios address about half of the defects that are covered by the 
Checklist. Appendix C contains the complete list of Scenarios. 


2.2.3 Defect Report Forms 

We also developed a Defect Report Form. Whenever a potential defect was discovered - during either the 
defect detection or the collection activities - an entry was made on the form. The entry included four kinds 
of information: Inspection Activity (Detection, Collection); Defect Location (Page and Line Numbers); Defect 
Disposition, (Defects can be True Defects or False Positives); and a prose Defect Description. 

A ^aU sample of a Defect Report appears in Figure 3. 
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2.3 Experiment Preparation 

The participants were given a series of lectures on software requirements specifications, the SCR tabular re- 
quirements notation, inspection procedures, the defect classification scheme, and the filling out of data mlWtion 
forms. The references for these lectures were Fagan [6], Pamas [13], and the IEEE Guide to Software Require- 
ments Specifications [1]. 

The participants were then assembled into three-person teams - see Section 2.1.3 for Within each 

team, members were randomly assigned to act as the moderator, the recorder, or the reader during the collection 
meeting. 

2.4 Conducting the Experiment 

2.4.1 Training 

For the training exercise, each team mspected the ELEVATOR SRS. Individual team members read the specifi- 
cation and recorded all defects they found on a Defect Report Form. Their efforts were restricted to two hours. 
Later we met with the participants and answered questions about the experimental procedures. Afterwards, each 
team conducted a supervised collection meeting and filled out a master Defect Report Form for the entire team. 
The ELEVATOR SRS was not used in the remainder of the experiment. 

2.4.2 Experimental Phase 

This phase involved two inspection rounds. The instruments used were the WLMS and CRUISE specifications 
discussed in Section 2.2.1, a checklist, three groups of drfect-based scenarios, and the Defect Report Form. The 
development of the checklist and scenarios is described in Section 2.2.2. The same checklist and scenarios were 
used for both documents. 

During the first Round, four of the eight teams were asked to inspect the CRUISE specification; the remaining 
four teams inspected the WLMS specification. The detection methods used by each team are shown in Table 1. 
Defect detection was limited to two hours, and all potential defects were reported on the Defect Report Form. 
After defect detection, all materials were collected.® 

*For r<^d, we set Aside 14 two-hour time slots during which inspccUon tasks could be done. Participants pcrfoimed each 
task within a single two-hour session and were not allowed to work at other tivn^ 
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Figure 4: Data Collectioxi for WLMS inspections. This figure shows the data collected iroxn one 
team’s WLMS inspection. The first three rows identify the review team members, the detection methods they 
used, the number of defects they found, and shows their individual defect summaries. The fourth row contains 
the team defect summary. The defect summaries show a 1 (0) where the team or individual found (did not find) a 
defect. The fifth row contains the defect key which identifies those reviewers who were responsible for the defect 
(AH for Ad Hoc only; CH for Checklist or Ad Hoc; DT for data type inconsistencies, Checklist, and Ad Hoc; IF 
for incorrect functionality, Checklist and Ad Hoc; and MA for missing or ambiguous functionality, Checklist and 
Ad Hoc). Meeting gain and loss rates can be calculated by comparing the individual and team defect summaries. 
For instance, defect 21 is an example of meeting loss. It was found by reviewer 44 during the defect detection 
activity, but the team did not report it at the collection meeting. Defect 32 is an example of meeting gain; it is 
first discovered at the collection meeting. 


Once all team members had finished defect detection, the team’s moderator arranged for the collection 
meeting. At the collection meeting, the documents were reread and defects discussed. The team’s recorder 
maintained the team’s master Defect Report Form. Collection was also limited to 2 hours. The entire Round 
was completed in one week. 

The second Round was similar to the first except that teams who had inspected the WLMS during Round 1 
inspected the CRUISE in Round 2 and vice versa. 


3 Data and Analysis 

3.1 Data 

Three sets of data are important to our study: the defect kqr, the team defect summeuies, and the individual 
defect summaries. 

The defect key encodes which reviewers are responsible for each defect. In this study, reviewer responsibilities 
are defined by the detection techniques a reviewer uses. Ad Hoc reviewers are responsible (asked to search for) for 
all defects. Checklist reviewers are responsible for a large subset of the Ad Hoc defects® . Since each Scenario is a 
refinement of several Checklist items, each Scenario reviewer is responsible for a distinct subset of the Checklist 

^j.e., defects for which an Ad Hoc reviewer is re^nsiUe. 
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Figure 5: Individual and Team Defect Summaries (CRUISE). This figure shows the data coUected from 
one team's CRUISE inspection. The data is identical to tiat of the WLMS inspections except that the CRUISE 
has fewer defects — 26 versus 42 for the WLMS - and the defect key is different. 


defects. 

The team defect summary shows whether or not a team discovered a particulM defect. This data is gathered 
from the defect report forms filled out at the collection meetings and is used to assess the effectiveness of each 
defect detection method. 

The individual defect summary shows whether or not a reviewer discovered a particular defect. This data is 
gathered from the defect report forms each reviewer completed during their defect detection activity. Together 
with the defect key it is used to assess whether or not each detection technique improves the reviewer's ability 
to identify specific classes of defects. 

We measure the value of coUection meetings by comparing the team and individual defect summaries to 
determine the meeting gain and loss rates. 

One team's individual and team defect summaries, and the defect key are represented in Figures 4 and 
Figure 5. 

3.2 Analysis of Team Performance 

Figure 6 su mm arizes the team performance data. As depicted, the Scenario detection method resulted in the 
highest defect detection rates, followed by the Ad Hoc detection method, and finally by Checkli^ the detection 
method. 

Table 2 presents a statistical analysis of the team performance data as outlined in Section 2.1.5. The inde- 
pendent variables are listed from the most to the least significant. The Detection method and Specification are 
significant, but the Round, Replication, and Order are not. 

Next, we analyzed the combined Instrumentation and Treatment effects. Table 3 shows the input to this 
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Figure 6: Defect Detection Rates by Independent Variable. The dashes in the far left column show each 
team’s defect detection rate for the WLMS and CRUISE. The horizontal line is the average defect detection rate. 
The plot demonstrates the ability of each variable to explain variation in the defect detection rates. For the 
Specification variable, the vertical location of WLMS (CRUISE) is deter min ed by averaging the defect detection 
rates for all teams inspecting WLMS (CRUISE). The vertical bracket, ], to the right of each variable shows one 
standard error of the difference between two settings of the variable. The plot indicates that both the Method 
and Specification are significant; but Round, Replication, and Order are not. 


Independent 

Variable 

SSt 

2T 

SSr 


{SSt/i^){vr/SSr) 

Significance 

Level 

Detection Method - treatment 

.200 

2 

.359 

29 

8.064 

< .01 

Specification- instrumentation 

.163 

1 

.396 

30 

12.338 

< .01 

Inspection round - maturation 

.007 

1 

.551 


.391 

.54 

Experimental run - replication 

.007 

mm 

.551 

30 

.391 

.54 

Order - presentation 

.003 

1 

.003 


.141 

.71 

Team composition - selection ' 

mmm 




1.151 

.39 


Table 2; Analysis of Variance for Each Independent Variable. The analysis of variance shows that only 
the choice of detection method and specification significantly explain variation in the defect detection rate. Team 
composition is also not significant. 


analysis. Six of the cells contain the average detection rate for teams using eaudi detection method and specification 
(3 detection methods applied to 2 specifications). The results of this analysis, shown in Table 4, indicate that the 
interaction between Specification and Method is not significant. This means that although the average detection 
rates varied for the two specifications, the effect of the detection methods is not linked to these differences. 
Therefore, we reject the null hypothesis that the detection methods have no effect on inspection performance. 
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Detection Method | 

Ad Hoc 


Scenario 

WLMS 

.5 .38 .29 .5 .48 .45 

.29 .52 .5 .33 



.43 

.41 


Cruise 

.46 .27 .27 .23 .38 .23 .35 

.19 .31 .23 .23 

.5 .42 .42 .54 .35 

(average) 

.31 

.24 

.45 


Table 3: Team Defect Detection Rate Data. The nominal and average defect detection rates for ail 16 
teams. 


HSect 

SSt 

Ux 

SSr 


{SSt/ i^X^r/SSr) 

Significance 

Level 

Detection Method 


2 



12.235 

<.01 

Specification 


mm 



17.556 

< .01 

MethxSpec 


2 



.217 



Table 4: Ajialysis of Variance of Detection Method and Specification. This table displays the results of 
an analysis of the variance of the average detection rates given in Table 3. 


3.3 Effect of Scenarios on Individual Performance 

We initially hypothesized that increasing the specialization and coordination of each reviewer’s responsibilities 
would improve team performance. We proposed that the Scenario would be one way to achieve this. We have 
shown above that the teams using Scenarios were the most effective. However, this did not establish that the 
improvement was due to increases in specialization and coordination, and not to some other factor. 

Consequently, our concern is to determine exactly how the use of Scenarios affected the reviewer’s performance. 
To examine this, we formulated two hypothesis schemas. 

♦ Hi; Method X reviewers do not find any more X defects than do method Y reviewers. 

• H2: Method X reviewers find either a greater or smallear number of non X defects than do 
method Y reviewers. 

Alternative explanations for the observed improvement could be (1) the Scenario reviewers responded to some 
perceived expectation that their performance should improve; or (2) the Scenario approach improves individual 
performance regardless of Scenario content. 


3.3.1 Rejecting the Perceived Expectation Argument 

If Scenario reviewers performed better than Checklist aind Ad Hoc reviewers on both scenario>targeted and non- 
scenario-targeted defects, then we must consider the possibility that their improvement was caused by something 
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Table 5: Significance Table for HI hypotheses: WLMS inspections* This table tests the HI hypothesis 
- Method X reviewers do not find any more X defects than do method Y reviewers - for all pairs of detection 
methods. Eaudi row in the table corresponds to a population of reviewers and the population of defects for which 
they were responsible, i.e., method X reviewers and X defects. The last five columns correspond to a second 
reviewer population, i.e., method Y reviewers. Each cell in the last five columns contains two values. The first 
value is the probability that HI is true, using the one-sided Wilcoxon-Mann-Whitney test. The second value - 
in parentheses - is the median number of defects found by the method Y reviewers. 



Table 6: Significance Table for HI hypotheses: CRUISE inspections* This analysis is identical to that 
performed for WLMS inspections. However, we chose not to perform any statistical analysis for the Missing 
Functionality and Incorrect Functionality defects because there are too few defects of those t3rpes. 
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Reviewers Using Method 

1 Finding Defects of Type 

1 Compared with Reviewers using Method 

Detection 

Method 

Number 

Reviewers 

Defect 

Population 

Number 

Present 

DT 

MF 

IF 

CH 

AH 

DT 

6 

DT« 

28 



nsi 

im^rn 

ilBI 


MF 

6 

MF= 

37 

.87 

(11) 

(9.5) 

.83 

(12.5) 


.64 

(10) 

IF 

6 

IF« 

37 




1^ 

11^911 

CH 

12 

CH' 

4 

1^ 

m 

.35 

(1) 

— 

(1) 


AH 

18 

AH' 

0 

NA 

(0) 

NA 

(0) 

NA 

(0) 

NA 

(0) 

(0) 


Table 7: Si gnifi ca n ce Table for H2 hypothesis: WLMS inspections. This table tests the H2 hypothesis 
- Method X reviewers find a greater or smaller number of non X defects than do method Y reviewers - for all 
pairs of detection methods. Each row in the table corresponds to a population of reviewers and the population of 
defects for which they were not re^>onsible - i.e., method X reviewers and non X defects (the complement of the 
set of X defects). The last five columns correspond to a second reviewer population) i.e., method Y reviewers. 
Each cell in the last five columns contains two ^ues. The first value is the probability that H2 is true, using the 
two-sided Wilcoxon-Mann- Whitney test. The second value is the median number of defects found by the method 
Y reviewers. 


other than the scenarios themselves. 

One possibility was that the Scenario reviewers were merely reacting to the novelty of using a clearly different 
approach, or to a perceived expectation on our part that their performance should improve. To examine tKig 
we analyzed the individual defect summaries to see how Scenario reviewers differed from other reviewers. 

The detection rates of Scenario reviewers^ are compared with those of all other reviewers in Tables 5, 6, 7 
and 8. Using the one and two-sided Wilcoxon-Mann- Whitney tests [15], we found that in most cases Scenario 
reviewers were more effective than Checklist or Ad Hoc reviewers at finding the defects the scenario was designed 
to uncover. At the same time, all reviewers, regardless of which detection method each used, were equally effective 
at finding those defects not targeted by any of the Scenarios. 

Since Scenario reviewers could not have known the defect classifications, it is unlikely that their reporting could 
have been biased. Therefore these results suggest that the detection rate of Scenario reviewers shows improvement 
only with regard to those defects for which they are explicitly responsible. Consequently, the argument that the 
Scenario reviewers’ improved performance was primly due to raised expectations or unknown motivational 
factors is not supported by the data. 

reviewers using Scenarios. 
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Reviewers Using Method 

Finding Defects of Type 

Compared with Reviewers using Method 

Detection 

Method 

Number 

Reviewers 

Defect 

Population 

Number 

Present 

DT 

MF 

IF 

CH 

AH 

DT 

5 

D'T 

16 

(2) 


■aai 


.46 

(2) 

MF 

5 

MF= 

25 

.96 

(8) 

(5) 

.33 

(4) 

.06 

(3) 

.62 

(5) 

IF 

5 

IF= 

23 

Kia 

181 

.41 

(4) 

(5) 

.44 

(2-5) 

.57 

(5) 

CH 

12 

mm 

2 

NA 

(0) 

NA 

(1) 

NA 

(0) 

(0) 

NA 

(0) 

AH 

21 

AH' 

0 

NA 

(0) 

NA 

(0) 

NA 

(0) 

NA 

(0) 

(0) 


Table 8: Significance Table for H2 hypothesis: CRUIS£ inspections. This analysis is identical to that 
performed for WLMS inspections. However, we chose not to perform statistical analysis for the non non Checklist 
defects because there are too few defects of that type. 


3.3.2 Rejecting the General Improvement Argument 

Another possibility is that the Scenario approach rather than the content of the Scenarios was responsible for 
the improvement. 

Each Scenario targets a specific set of defects. If the reviewers using a type X Scenario had been no more 
effective at finding type X defects than had reviewers using non-X Scenarios, then the content of the Scenarios 
did not significantly influence reviewer performance. If the reviewers using a type X Scenario had been more 
effective at finding non-X defects than had reviewers using other Scenarios, then some factor beyond content 
caused the improvement. 

To explore these possibilities we compared the Scenario reviewers’ individual defect summaries with each 
other. 

Looking again at Tables 5, 6, 7, and 8 we see that each group of Scenario reviewers were the most effective 
at finding the defects their scenarios were designed to detect, but were generally no more effective than other 
Scenario reviewers at finding defects their Scenarios were not designed to detect. 

Since Scenario reviewers showed improvement only in finding the defects for which they were explicitly re- 
sponsible, we conclude that the content of the Scenario was primarily responsible for the improved reviewer 
performance. 
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Figure 7: Meeting Gains for WLMS Inspections. Each point represents the meeting gain rate for a single 
inspection, i.e., the number of defects first identified at a collection meeting divided by the total number of defects 
in the qredfication. Each rate is marked with symbol indicating the inspection method used. The vertical line 
segment through e^ symbol indicates one standard deviation in the estimate (assuming each defect was a 
Bernoulli trial). This information helps in assessing the significance of any one rate. The average meeting gain 
rate is 4.7 ± 1.3% for the WLMS. (3.1 ± 1.1% for the CRUISE.) 

3.4 Analysis of Checklists on Individual Performance 

The scenarios used in this study were derived from the checklist. Although this checklist targeted a large number 
of existing defects, our analysis shows that the performance of Checklist teams were no more effective than Ad 
Hoc teams. One explanation for this is that nonsystematic techmques are difficult for reviewers to implement. 

To study this explanation we again tested the HI hypothec that Checklist reviewers were no more effective 
than Ad Hoc reviewers at finding Checklist defects. 

Rom Tables 5 and 6 we see that even though the Checklist targets a large number of defects, it does not 
actually improve a reviewer’s ability to find those defects. 

3.5 Analysis of Collection Meetings 

In his original paper on software inspections Fagan [6] asserts that 

Sometimes flagrant errors are found during . . . [defect detection], but in general, the number of errors 
found is not nearly as high as in the . . . [collection meeting] operation. 
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From a study of over 50 inspections, Votta [17] collected data that strongly contradicts this assertion. In this 
Section, we meastire the benefits of collection meetings by comparing the team and individual defect summaries 
to determine the meeting gain and meeting loss rates. (See Figure 4 and Figure 5). 

A ” meeting gain” occurs when a defect is found for the first time at the collection meeting. A ” meeting loss” 
occurs when a defect is first found during an individual’s defect detection activity, but it is subsequently not 
recorded during the collection meeting. Meeting gains may thus be offset by meeting losses and the difference 
between meeting gains and meeting losses is the net improvement due to coDection meetings. 

Our results indicate that collection meetings produce no net improvement. 

3.5.1 Meeting Gains 

The meeting gain rates reported by Votta were a negligible 3.9 ± .7%. Our data tells a similar story. (Figure 7 
displays the meeting gain rates for WLMS inspections.) The mean gain rate is 4.7 ± 1.3% for WLMS inspections 
and 3.1 ± 1.1% for CRUISE inspections. The rates are not significantly different. 

It is interesting to note that these results are consistent with Votta’s earlier study even though Votta’s 
reviewers were professional software developers and not students. 

3.5.2 Meeting Losses 

The average meeting loss rates were 6.8 ± 1.6% and 7.7 ± 1 . 7 % for the WLMS and CRUISE respectively. (See 
Figure 8.) 

One cause of meeting loss might be that reviewers are talked out of the belief that something is a defect. 
Another cause may be that during the meeting reviewers forget or can not reconstruct a defect found earlier. 

This effect has not been previously reported in the literature. However, since the interval between the detection 
and collection activities is usually longer in practice than it was in our experiment (one to two days in our study 
versus one or two weeks in practice), this effect may be quite significant. 

3.5.3 Net Meeting Improvement 

The average net meeting improvement is -.9±2.2 for WLMS inspections and — 1.2± 1.7 for CRUISE inspections. 
(Figure 9 displays the net meeting improvement for WLMS inspections.) We found no correlations between the 
loss, gain, or net improvement rates and any of our experiment’s independent variables. 
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Figure 8: Meeting Loss Rate for WLMS Rispections. Each point represents the meeting loss rate for a 
single inspection. The meeting loss rate is the number of defects first detected by an individual reviewer divided 
by the total number of defects in the specification. Each rate is marked with a symbol indicating the inspection 
method used. The vertical line segmait through each symbol indicates one standard deviation in the estimate 
of the rate (assuming each fault was a Bernoulli trial). This information helps in determining the significance of 
any one rate. The average team loss rate is 6.8 ± 1.6% for the WLMS. (7.7 ± 1.7% for CRUISE). 


4 Summary and Conclusions 

Our experimental design for comparing defect detection methods is fiezible and economical, and allows the 
experimenter to assess several potential threats to the experiment ’s internal vahdity. In particular, we determined 
that neither maturation, replication, selection, or presentation ^ects had any significant influence on inspection 
performance. However, differences in the SRS did. 

From our analysis of the experimental data we drew several conclusions. 

1. The defect detection rate when using Scenarios is superior to that obtained with Ad Hoc or 
CheckUst methods — an improvement of roughly 35%. 

2. Scenarios help reviewers focus on specific defect classes. Furthermore, in comparison to Ad Hoc 
or Checklist methods, their ability to detect other classes of defects is not compromised. (It should be 
noted however, that the scenarios appeared to be better suited to the defect profile of the WLMS than the 
CRUISE. This indicates that poorly designed scenarios may lead to poor inq>ection performance.) 

3. The Checldist method - the industry standard, was no more effective than the Ad Hoc 
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Figure 9: Net Meeting Improvement for WLMS. E^ach symbol indicates the net meeting improvement for 
a single inspection. The average net meeting improvement rate is —.9 ± 2.2 for the WLMS. (—1.2 ± 1.7 for the 
CRUISE). These rates are not significantly different from 0. 

detection method. 

4. On the average, collection meetings contribute nothing to defect detection effectiveness. 

The results of this work have important implications for software practitioners. The indications are that 
overall inspection performance can be improved when individual reviewers use systematic procedures to address 
a small set of specific issues. This contrasts with the usual practice, in which reviewers have neither systematic 
procedures nor clearly defined responsibilities. 

Economical experimental designs are necessary to allow replication in other environments with different 
populations. For software researchers, this work demonstrates the feasibility of constructing and executing 
inexpensive experiments to validate fundamental research recommendations. 

5 Future Work 

The experimental data raise many interesting questions for future study. 

• In many instances a single reviewer found a defect, but the defect was not subsequently recorded at the 
collection meeting. Are single reviewers sometimes forgetting to mention defects they observed, or is 
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the reviewer being talked out of the defect at the team meeting? What are the signiiicaiit suppression 
mechanisms affecting collection meetings? 

• Very few defects are initially discovered during collection meetings. Therefore, in view of their impact on 
production interval, are these meetings worth holding? 

• More than half of the defects are not addressed by the Scenarios used in this study. What other Scenarios 
are necessary to achieve a broader defect coverage? 

• There are several threats to this experiment's external validity. These threats can only be addressed by 
replicating and reproducing these studies. Each new run reduces the probability that our results can be 
explained by human variation or e:q>eriment 2 d error. Consequently, we are creating a laboratory kit (i.e., 
a package containing all the experimental materials, data, and analysis) to facilitate replication. The kit 
should be publicly available by June, 1994. 

• Finally, we are using the lab kit to reproduce the experiments with other university researchers in Japan, 
Germany, Italy, and Australia and with industrial developers at AT&T Bell Laboratories and Motorola 
Inc. These studies will allow us to evaluate our hypotheses with different populations of programmers and 
different software artifacts. 
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A Ad Hoc Detection 

The defect taxonomy is due to the work of Schneider, et al., and Basili and Weiss. 

• Omission 

Missing Fonctionaiity: Infonnation describing the desired internal operational behavior of the system 
has been omitted from the SRS. 

“ Missmg Performance: Information describing the desired performance specifications has either been 
omitted or described in a way that is unacceptable for acceptance testing. 

- Missing Interface: Information describing how the proposed system will interface and communicate 
with objects outside the the scope of the system has omitted from the SRS. 

- Mi^g Environment: Information describing the required hardware, software, database, or personnel 
environment in which the system will run has been omitted from the SRS 

« Commission 

— Ambiguous Iiriormation: An important term, phrase or sentence essential to the understanding of 
system behavior has either been left undefined or defined in a way that can cause confusion and 
misunderstanding. 

- Inconsistent Information: Two sentences contained in the SRS directly contradict each other or express 
actions that cannot both be correct or cannot both be carried out. 

— Inconect Fact: Some sentence contained in the SRS asserts a facts that cannot be true under the 
conditions specified in the SRS. 

- Wrong Section: Essential information is misplaced within the SRS 
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B Checklist Method 


♦ General 

- Are the goals of the system defined? 

- Are the requirements dear and unambiguous? 

- Is a functional overview of the system provided? 

- Is an overview of the operational modes provided? 

- Have the software and hardware environments been specified? 

- If assumptions that affect implementation have been made, are they stated? 

- Have the requirements been stated in terms of inputs, outputs, and processing for each function? 

- Are all functions, devices, constraints traced to requirements and vice versa? 

- Are the required attributes, assumptions and constraints of the system completely listed? 

# Omission 

- Missing Functionality 

* Are the described functions sufficient to meet the system objectives? 

* Are all inputs to a function sufficient to perform the required function? 

* Are undesiied events considered and their required responses specified? 

^ Are the initial and special states considered (c.g., system initiation, abnormal termination)? 

- Missing Performance 

* Can the system be tested, demonstrated, analyzed, or inspected to show that it satisfies the 
requirements? 

* Have the data type, rate, units, accuracy, resolution, limits, range and critical values 

* for all internal data items been specified? 

* Have the accuracy, precision, range, type, rate, units, frequency, and volume of inputs and outputs 
been specified for each function? 

— Missing Interface 

* Are the inputs and outputs for all interfaces sufficient? 

* Are the interface requirements between hardware, software, personnel, and procedures included? 

- Missing Environment 

* Have the functionality of hardware or software interacting with the system been properly specified? 
• Co mmis sion 

- Ambiguous Information 

* Are the individual requirements stated so that they are discrete, unambiguous, and testable? 

* Are all mode transitions specified deterministicly? 

~ Inconsistent Information 

^ Are the requirements mutually consistent? 

* Are the functional requirements consistent with the overview? 

Are the functional requirements consistent with the actual operating environment? 

- Incorrect or Extra Functionality 

* Are all the described functions necessary to meet the system objectives? 

^ Are all inputs to a function necessary to perform the required function? 

* Are the inputs and outputs for all interfaw:es necessary? 

* Are all the outputs produced by a function used by another function or transferred across an 
external interface? 

- Wrong Section 

* Are all the requirements, interfaces, constraints, etc. listed in the appropriate sections. 
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C Scenarios 

C.l Data T*ype Consistency Scenario 

1. Identify all data objects mentioned in the overview (e.g., hardware component, application variable, abbre- 
viated term or function) 

(a) Are all data objects mentioned in the overview listed in the external interface section? 

2. For each data object appearing in the external interface section determine the following information: 

• Object name: 

• Class: (e,g., input port, output port, application variable, abbreviated term, fimction) 

• Data type: (e.g., integer, time, boolean, enumeration) 

• Acceptable values: Are there amy constraints, ranges, limits for the values of this object 

• Failure value: Does the object have a special failure value? 

• Units or rates: 

• Initial value: 

(a) Is the object’s specification consistent with its description in the overview? 

(b) If object represents a physical quantity, are its units properly specified? 

(c) If the object’s value is computed, can that computation generate a non-acceptable value? 

3. For each functional requirement identify all data object references: 

(a) Do all data object references obey formatting conventions? 

(b) Are all data objects referenced in this requirement listed in the input or output sections? 

(c) Can any data object use be inconsistent with the data object’s type, acceptable values, failure value, 
etc.? 

(d) Can any data object definition be inconsistent with the data object’s type, acceptable values, failure 
value, etc.? 

C*2 Incorrect Functionality Scenario 

1. For each functional requirement identify all input/output data objects: 

(a) Are all values written to each output data object consistent with its intended function? 

(b) Identify at least one function that uses each output data object. 

2. For each functional requirement identify all specified system events: 

(a) Is the specification of these events consistent with their intended interpretation? 

3. Develop an invariant for each system mode (i.e. Under what conditions must the system exit or remain in 
a given mode)? 

(a) Can the system’s initial conditions fail to satisfy the initial mode’s invariant? 

(b) Identify a sequence of events that allows the system to enter a mode without satisfying the mode’s 
invariant. 

(c) Identify a sequence of events that allows the system to enter a mode, but never leave (deadlock). 
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C.3 Ambiguities Or Missing Functionality Scenario 

1. Identify the required precision, response time, etc. for each functional requirement. 

(a) Are all required precisions indicated? 

2. For each requirement, identify all monitored events. 

(a) Does a sequence of events exist for which multiple output values can be computed? 

(b) Does a sequence of events exist for which no output value will be computed? 

3. For each system mode, identify all monitored events. 

(a) Does a sequence of events exist for which transitions into two or more system modes is allowed? 
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Software 

Process 

Evolution 

at the SEE 

Victor Basil/, University of Maryland 

Scott Green, NASA Goddard Space Flight Center 


♦ The Software Engineering 
Labo7‘ato7j has been adapting^ 
analyzings and evolving 
software processes for the last 
18 years. Their apprvacb is based 
on the Quality Improvement 
Paradigm^ which is tised to 
rvaluate process effects on both 
product a7id people. The authors 
explain this appioach as it was 
applied to rediue defects in code. 


ince 1976, the Software 
Engineering Laboratory of the 
National Aeronautics and Space 
Administration’s Goddard Space 
Flight Center has been engaged in a 
program of understanding, assessing, 
and packaging sofcu*are experience. 
Topics of study include process, prod- 
uct, resource, and defect models, as 
well as specific technologies and tools. 
The approach of the SEL — a consor- 
tium of the Software Engineering 
Branch of NASA Goddard’s Flight 
Dynamics Division, the Computer 
Science Department of the University 
of Maryland, and the Sofrw'are 
Engineering Operation of Computer 
Sciences Corp. — has been to gain an 
in-depth understanding of project and 
environment characteristics using 


process models and baselines. A 
process is evaluated for study, applied 
experimentally to a project, analyzed 
with respect to baselines and process 
model, and e\'aluated in terms of the 
experiment’s goals. Then on the basis 
of the experiment’s conclusions, 
results are packaged and the process is 
tailored for improvement, applied 
again, and ree\*aluated. 

In this article, we describe our 
improvement approach, the Quality 
Improvement Paradigm, as the SEL 
applied it to reduce code defects by 
emphasizing reading techniques. The 
box on p. 63 describes the Quality 
Improvement Paradigm in detail. In 
examining and adapting reading tech- 
niques, we go through a systematic 
process of evaluating the candidate 
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process and refining its implementa- 
I don ihrough lessons learned from pre- 
I vious experiments and studies. 

I As a result of this continuous, evo- 
lutionary process, we determined that 
w'c could successfully apply key ele- 
ments of the Cleanroom develop- 
ment method in the SEL environ- 
ment, especially for projects invoKing 
fewer than 50,000 lines of code (all 
I references to lines of code refer to 
I developed, not delivered, lines of 
I code). We saw indications of lower 
I error rates, higher productivity, a 
more complete and consistent set of 
code comments, and a redistribution 
of developer effort. Although we have 
' not seen similar reliability and cost 
gains for larger efforts, we continue to 
} investigate the Cleanroom method’s 
i effect on them. 


EYAiUATlNG CANDIDAH PROaSSES 

To enhance the possibility of 
improvement in a particular environ- 
ment, the SEL introduces and evalu- 
ates new technology wthin that envi- 
ronment. This involves experimenta- 
tion with the new technology, record- 
ing findings in the context of lessons 
learned, and adjusting the associated 
processes on the basis of this experi- 
ence. When the technology is notably 
risky — substantially different from 
what is femiliar to the environment — 
or requires more detailed evaluation ! 
than would normally be e.xpended, the j 
SEL conducts experimentation off- ! 
line from the project ennronmcnt. 

Off-line experiments may cake the 
fonn of either controlled experiments 
or case studies. Controlled experi- 
I ments are warranted when the SEL 
needs a detailed analysis with statistical 
j assurance in the results. One problem 
; with controlled experiments is that the 
project must be small enough to repli- 
cate the experiment ^veral times. The 
i SEL then performs a case study to val- 
idate the results on a projca of credi- ; 
ble size that is representative of the j 
environment. The case study adds * 


validity and credibility through the use | 
of typical development systems and j 
professional staff. In analyzing both ; 
controlled experiments and case stud- ! 
ies, the Goal/Question/Metric para- 
digm, described in the box on p. 63, 
provides an important framework for 
focusing the analysis. 

On the basis of experimental I 
results, the SEL packages a set of ^ 
lessons learned and makes them av*ail- | 
able in an experience base for future ! 
analysis and application of the tech- | 
nology. 

Experinent 1: Reading versos testing. 

Although the SEL had historically 
been a test-driven organization, w*e 
decided to experiment with introduc- 
ing reading techniques. We were par- 
ticularly interested in how reading 
would compare with testing for fault 
detection. The goals of the first off- 
line, controlled e-xperiment* were to 
analyze and compare code reading, 
functional testing, and structural test- 
ing, and to evaluate them with respect 
to fault-detection effectiveness, cost, 
and classes of faults detected. 

We needed an analysis from the i 
viewpoint of quality assurance as well 
as a comparison of performance with | 
respect to softw'are type and program- ^ 
mer experience. Using the GQM par- 
adigm, we generated specific questions 
on the basis of these goals. 

We had subjects use reading by 
stepwise abstraction,- equivalence-par- 
titioning boimdary-value testing, and 
statement-coverage structural testing. 

Wc conducted the experiment 
twice at the University of Maryland on : 
graduate students (42 subjects) and | 
once at NASA Goddard (32 subjects). 
The experiment structure was a frac- | 
tional fretoriai design, in which even* 
subject applied each technique on a i 
different program. The programs ; 
included a text formatter, a plotter, an 
abstract data type, and a database, and 
they ranged from 145 to 365 lines of 
code. We seeded each program with 
faults. The reading performed was at 
the unit level. 


.Although the results from both 
experiments support the emphasis on 
reading techniques, we report only the 
results of the controlled experiment on 
the NASA Goddard subjects because it 
involved professional developers in the 
target environment. 

Figure 1 shows the fault-detection 
effectiveness and rate for each 
approach for the NASA Goddard 
experiment. Reading by stepwise 
abstraction proved sup>erior to testing 





Figure 1. Results of the reading’^ver- 
siis-testhig controlled experiment, in 
ZL'bich reading Z'as compared ii^itb 
functional and structural testing. (A) 
Mean mitnber of faults detected for 
each technirjue and (B) number of 
faults detected per hour of use for each 
rechniffue. 
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techniques in both the effectiveness 
and cost of fault detection, while obvi- 
ously using fewer computer resources. 

Even more interesting was that the 
subjects did a better job of estimating 
the code quality using reading than 
they did using testing. Readers 
thought they had found only about 
half the faults (which was nominally 
correct), while functional testers felt 
that had found essentially all the faults 
(which was never correct). 

Furthermore, after completing the 
experiment, more than 90 percent of 
the participants thought functional 
testing had been the most effective 
technique, althou^ die results clearly 
showed otherwise. This gave us some 
insight into the psychological effects of 
reading versus testing. Perhaps one 
reason testing appeared more satisfy- 
ing was that the successful execution of 
multiple test cases generated a greater 
I comfort level with the product quality, 

! actually providing the tester with a 
' false sense of confidence, 

; Reading was also more effective in 
uncovering most classes of faults, 
|] including interface faults. This told us 


that perhaps reading might scale up 
well on larger projects. 

Experinent 2: Vaiidatioii with Qecmrooiiu 

On the basis of these results, we 
decided to emphasize reading tech- 
niques in the SEL environment. 
However, we saw iitde improvement 
in overall reliability of the develop- 
ment systems. Part of the reason may 
have been that SEL projea personnel 
had developed such faith in testing 
that the quality of their reading was 
relaxed, with the assumption that test- 
ing would ultimately uncover the 
same faults. We conducted a small 
off-line experiment at the University 
of Maryland to test this hypothesis; 
the results supported our assumption. 
(We did this on a small scale just to 
verify our hypothesis before continu- 
ing with the Cleanroom experiment.) 

Why the Cleanroom method? The Clean- 
room method emphasizes human dis- 
cipline in the development process, 
using a mathematically based design 
approach and a statistical testing 
approach based on anticipated opera- 


tional use.3 Development and testing 
teams are independent, and all devel- 
opment-team activities are performed 
without on-line testing. 

Techniques associated with the 
method are the use of box structures 
and state machines, reading by step- 
wise abstraction, formal correctness 
demonstrations, and peer review. 
System development is performed 
through a pipeline of small increments 
to enhance concentration and permit 
testing and de^^clopmcnt to occur in 
paralld. 

Because the Cleanroom method 
removes developer testing and relies 
on human discipline, we felt it would 
overcome the psychological barrier of 
reliance on testing. 

Applying the QIF. The first step of the 
Quality Improvement Paradigm is to 
characterize the project and its envi- 
ronment. The removal of developer 
unit testing made the Cleanroom 
method a hi^-risk technology. Again, 
we used off-line experimentation at 
the University of Maryland as a miti- 
gating approach."^ The environment 
was a laboratory course at the univer- 
sity, and the project involved an elec- 
tronic message system of about 1,500 
LOC. The experiment structure was a 
simple replicated design, in which 
control and experiment teams are 
defined. We assigned 10 three-person 
experiment teams to use the 
Cleanroom method. We gave five 
three-person control teams the same 
development methodology, but 
allowed them to test their systems. 
Each team was allowed five indepen- 
dent test submissions of their pro- 
grams. We collected data on program- 
mer background and attitude, com- 
puter-resource activity, and actual 
testing results. 

The second step in the Quality 
Improvement Paradigm is to set goals. 
The goal here was to analyze the 
effects of the Cleanroom approach and 
evaluate it with respect to process, 
product, and participants, as compared 
wdth the non-Cleanroom approach. 



Figure 2. Sample measures, baselines, and expectations for the case studies investi- 
gating the Cleanroom method. 
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We generated questions correspond- 
ing to this goal, focusing on the 
method’s effect on each aspect being 
studied. 

The next step of the Quality Im- 
provement Paradigm involves select- 
ing an appropriate process model. The 
process model selected for this experi- 
ment was the Cleanroom approa^ as 
defined by Harlan Mills at IBM’s 
Federal Systems Division, but modi- 
fied for our environment. For exam- 
ple, the graduate-student assistant for 
the course served as each group’s inde- 
pendent test team. Also, because we 
used a language unf amiliar CO the sub- 
jects to prevent bias, there was a risk of 
errors due solely to ignorance about 
the language. We therefore allowed 
teams to cleanly compile their code 
before submitting it to the tester. 

Because of the nature of controlled 
experimentation, we made few modifi- 
cations during the experiment. 

Oeaiucom’s eflfea on the sofhvaie- 
dcvclopment process resulted in the 
Cleanroom developers more effective- 
ly applying the off-line reading tech- 
niques; the non-Clcanroom teams i 
focused their efforts more on func- 
tional testing than reading. The 
Cleanroom teams spent less time on- 
line and were more successful in mak- 
ing scheduled deliveries. Further 
analysis revealed that the Cleanroom 
products had less dense complexity, a 
higher percentage of assignment suce- 
mcncs, more global data, and more 
code comments. These products also 
more completely met the system 
requirements and had a higher per- 
centage of successful independent test 
cases. 

The Cleanroom developers indicat- 
ed that they modified their normal 
software-development activities by 
doing a more effective job of reading, 
though they missed the satisfaction of 
actual program execution. Almost all 
said they would be willing to use 
Cleanroom on another development 
assignment. 

Through observation, it was also 
clear chat the Cleanroom developers 


did not apply the formal methods 
associated with Cleanroom very rigor- 
ously. Furthermore, we did not have 
enough failure data or experience with 
Cleanroom testing to apply a reliabili- 
ty model. However, general analysis 
did indicate that the 
Qeanroom approach had 
potential payoff, and that 
additional invesdgacion 
was warranted. 

You can also view this 
experteent fi-om the fol- 
lowing perspective: We 
applied two development 
approaches. The only 
' real difference between 
them was that the con- 
trol teams had one extra 
piece of technology 
(developer testing), yet 
they did not perform as well as the 
experiment teams. One explanation 
might be that the control group did 
not . use the available nontesting tech- 
niques as effectively because rfxcy knew 
they could rely on testing to detect 
faults. This supports our earlier find- 
ings associated with the reading-ver- 
sus-tesnng cxpcrimenL 

EVOLVING SOECTEDPROOSS 

The positive results gathered fix>ni 
these two experiments gave us the jus- 
tification we needed to explore the 
Cleanroom method in case studies, 
using typical development ^tems as 
data points. W'e conducted two case 
studies to examine the method, again 
following the steps of the Quality 
Improvement Paradigm. A third case 
study was also recently begun. 

First case sfody. The project we 
selected, Project 1, involved two sub- 
systems from a typical attitude 
ground-support system. The system 
performs ground processing to deter- 
mine a spacecraft’s attitude, receiving 
and processing spacecraft telemetry' 
data to meet the requirements of a 
particular mission. ! 


The subsystems we chose are an 
integral part of attitude determina- 
tion and are highly algorithmic. 
Both are interactive programs that 
together contain approximately 
40,000 LOC, representing about 12 
percent of the entire 
attitude groimd-support 
system. The rest of the 
ground-support system 
was developed using 
the standard S£L devel- 
opment methodology. 

The project was 
staffed principally by five 
people from the Flight 
Dynamics Division, 
which houses the SEL. 
All five were also work- 
ing on odier projects, so 
only part of their time 
was allocated to the two subsystems. 
Their other responsibilities often took 
time and attention away from the rase 
study, but this partial allocation repre- 
sents typical staffing in this environ- 
ment. All other projects with which 
the Project 1 staff" were involved were 
non-Qeanroom efforts, so staff mem- 
bers would often be required to use 
multiple develop-mcnt methodolo- 
gies during the same workday. 

The primary goal of the first case 
study was to increase sofrivare quality 
and reliability without increasing cost. 
We also wanted to compare the char- 
acteristics of the Cleanroom method 
with those t}^ical of the FDD envi- 
ronment. A well-calibrated baseline 
was available for comparison that 
described a variety of process charac- 
teristics, including effort distribution, 
change rates, error rates, and produc- 
tivity. The baseline represents the his- 
tory of many earlier SEL studies. 
Figure 2 shows a sample of the expect- 
ed variations from the SEL baselines 
for a set of process characteristics. 

Ckoosittg ond tailoring processes. The 
process models available for examina- 
tion were the standard SEL model, ^ 
which represents a reuse-oriented 
waterfall life-cycle model; the 


ALMOST 
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I IBM/FSD Cleanroom model, 'W'hich 
!, appeared in the literature and was 
I available through training; and the 
j experimental University of Maryland 
Cleanroom model, which was used in 
the earlier controlled experimenL^ 

We examined the lessons learned 
from appl 3 dng the IBM and University 
of Maryland models. The results from 
the IBM model were notably positive, 
showing that the basic process, meth- 
ods, and techniques were effective for 
that particular environment However, 
the process model had been applied by 
the actual developers of the methodol- 
ogy, in the environment for which it 
was developed. The University of 
Maryland model also had specific 
lessons, including the effects of not 
allowing developers to test their code, 
the effectiveness of the process on a 
small project, and the conclusion that 
formal methods appeared particular- 
ly difficult to apply and required specif- 
ic skills. 

On the basis of these lessons and the 
characteristics of our 
environment, we selea- 
ed a Cleanroom pro- 
cess model with four 
key elements: 

♦ separation of devel- 
opment and test teams, 

♦ reliance on peer 
review instead of unit- 
level testing as the pri- 
mary developer verifica- 
tion technique, 

♦ use of informal 
state machines and 
functions to define the 
system design, and 

♦ a statistical approach to testing 
based on operational scenarios. 

We also provided training for the 
subjects, consistent with a University of 
Maryland course on the Cleanroom 
process model, methods, and tech- 
niques, w’ith emphasis on reading 
through stepwse abstraction. We also 
stressed code reading by multiple 
reviewers because stepwse abstraction 
was new to many subjects. Michael 
Dyer and Terry Baker of IBM/FSD 


provided additional training and moti- 
vation by describing IBM’s use of 
Cleanroom. 

To mitigate risk and address the 
developers’ concerns, we examined 
backout options for the experiment. 
For example, because the subsystems 
were highly mathematical, we were 
afi^d it would be difficult to find and 
correct mathematical errors without 
any developer testing. Because the pro- 
ject was part of an operational system 
with mission deadlines, we discussed 
options that ranged from allowing 
developer unit testing to discontinuing 
Cleanroom altogether. These discus- 
sions helped allay the primary appre- 
hension of NASA Goddard manage- 
ment in using the new methodology. 
When we could not get information 
about process application, w’e followed 
standard SEL process-model activities. 

We also noted other management 
and project-team concerns. 
Requirements and specifications change 
frequently during the development 
cycle in the FDD envi- 
ronment This instabili- 
ty w'as of particular con- 
cern because the Clean- 
room method is built on 
the precept of de- 
veloping software right 
the first time. Another 
concern was that, given 
the difficulties encoun- 
tered in the University 
of Maryland cq>eriment 
about applying formal 
methods, how success- 
fully could a classical 
Qeanroom approach be 
applied? Finally, there was concern 
about the psychological effects of sepa- 
rating development and testing, specif- 
ically the inability of the developers to 
execute their code. We targeted all 
these concerns for our postproject 
analysis. 

Project 1 lasted from January 1988 
through September 1990. We separat- 
ed the five team members into a three- 
person development team and a two- 
person test team. The development 


PROJEG 
RESULTS 
LED US TO 
EMPHASIZE 
PEER REVIEWS 
AND USE OF 
INDEPENDENT 
TESTING. 


team broke the total effort into six 
incremental builds of approximately 
6,500 LOC each. An experimenter 
team consisting of NASA Goddard 
managers, SEL representatives, a tech- 
nology advocate familiar with the IBM 
model, and the project leader moni- 
tored the overall process. 

We modified the process in real 
time, as needed. For example, when w^e 
merged Cleanroom products into the 
standard FDD fbnnal review and doc- 
umentation actinties, we had to modify 
both. We altered the design process to 
combine the use of state machines and 
traditional structured design. We also 
collected data for the monitoring team 
at various points throughout the pro- 
ject, although we tried to do this with 
as little disturbance as possible to the 
project team. 

Analyzing and packaging results. The final 
steps in the QIP involve analyzing and 
packaging the process results. We 
found significant differences in effort 
distribution during development 
between the Cleanroom project and 
the baseline. Approximately six percent 
of the total projea effort shifr^ fix)m 
coding to design activities in the 
Cleanroom effort. Also, the baseline 
development teams traditionally spent 
approximately 85 percent of their cod- 
ing effort w’riting code, 15 percent 
reading it. The Cleanroom team spent 
about 50 percent in each activity. 

The primary goal of the first case 
study had been to improve reliability 
without increasing cost. Analysis 
showed a reduction in change rate of 
nearly 50 percent and a reduction in 
error rate of greater than a third. 
Although die expectation was for pro- 
ductivity equh'alent to the baseline, the 
Cleanroom effon also improved in that 
area by approximately 50 percent. Mfe 
also saw a decrease in rew'ork, as 
defined by the amount of time spent 
correcting errors. Additional analysis of 
code reading revealed that three 
fourths of all errors uncovered were 
found by only one reader. This 
prompted a renew^ed emphasis on mul- 
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QUALITY IMPROVEMENT PARADIGM: FOUNDATION FOR IMPROVEMENT 


I The Quality Iinprove- 
ment Paradigm is an effec- 
tive framework for conduct- 
ing experiments and studies 
like those described in the 
main texL It is an expeh- 
I mental but evoludonaiy 
I concept for learning and 
j improvemenL^ 

I The QIP has six steps: 
i 1. Characterize the pro- 
ject and its environment. - 

2. Set qu^dfiable goal^. 
for succes^ project perfor- 
mance and improvement. 

3 . Choose the appropri- 
ate process model^ support- 
ing mediods, and tools for 
dieprojjpct. ' 

. 4. Execute processes, 

^ construct £e produce col- 
lect and validate the pre- 
scribed data, and analyze the 
. data to provide real-time 
feedbag fr>r corrective 
action. 

5. Analyze the data to 
evaluace current practices, . 
determine problems, record 
findings, and make recom- 
mendations icr future 
process.improvements. . 

\ 6. Pack^ the dqpen^^ 
in ^ fbnn (^updated and re^ ; 
finixl models, and $ 2 ^ the •- 
knoiide<%e gained firom diis v 
and eai^ projects in an eipe- j ' 
lienoe base ^ future 


The QIP uses two tools: 
the Goal/Question/Metric 
paradigm and the 
Experience Factory 
Organization. 

GQII porodigB. The GQM 

.paradigm is a mechanism 
used in the p lanning phase 
of the Quality Improvement 
Paradigm for defining and 
' evaluating a set of opera- 
tional goals using measure- 
ment.2 It provides a system- 
atic approach for tailoring 
and int^;rating goals widi 
models of the sc^cware 
processes, products, and 
quali^ perspectives of inter- 
est, according to the ^>ecific 
needs of the project and 
organizatiotL 

You ^fine goals in an 
<^)crat 2 QDaI, tractable way by 
refining them into a set 
quesQons that extract appro- 
priate infonnaticm fixun the 
models. The questions, in 
turn, define the mectks . 
needed to define and inter- 
pret the goals. 

Agoal-genemon tsn^ 
opiate be^ in de^oping 


the essennal eleinents: the 
ohj^ of interest (like prod- 
. vct or prooessX Ae affect of 
i nt erest (liire cost or ^Oity 


to detect defects), the pur- 
pose of the study (like assess- 
ment or prediction), the 
point of view from which 
the study is performed (like 
customer’s or manager’s), 
and die context in which the 
study is performed (like peo- 
ple-oriented or problem-ori- 
ented factors). 

For example, two goals 
assocxaied with die a^^Uca- 
don of the Qeanroom 
method in the SEL were 
analysis of (he Qeanroom 
process to characterize 
resource allocation from the 
project manager's point of 
view, and analysis of the 
Qeanroom product to char- 
acterize dcfbcts from the 
customer’s point of view. 

Exjwfintt Fodory Orgoir 
nlioiL The ^perience 
Factory OrganizadoQ is an 
organizadonal structure that 
supports the acdvides sped-' 
fied in the QlP ^contmu- 
ou^ accumukting evaluat- 
ed experiences, heading a 
repository of integrated 
experience nibd^ thatpro- 
jects cm access mod^ . 
m men their^^hee^.^ T^ . 
Ejqjericna FiKto^ezrends - 
pro|ect-developinent aedvi-; 
des by providing systematic r : 


learning and packaging of 
reusable experiences. It 
packages experiences by- 
building informal, schema- 
tized, formal, and automated 
models and measures of 
software processes, products, 
and other forms of ^pwl- 
edge, and distributes them 
through consultadon, docu- 
mentadon, and automated 
support. 

While project organiza- 
don follows an evoludonary 
process model that reuses 
packaged experiences, the . 
E]perience Factory provides 
the set of processes needed ^ 
^ lianring, packaging, and 
storing the project organiza- 
dem’s experience for reuse. 
Hie Experience Factory 
Qrgahizadon represents the 
xntegiadon of these two 
funedems. 
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dple readers throughout the SEL envi- 
ronment. 

We also examined the earlier con- 
cerns expressed by managers and the 
project team. The results showed 
increased effort in early requirements- 
analysis and design activities and a 
clearer set of in-line comments. This 
led to a better understanding of the 
whole system and enabled the project 
team to understand and accommodate 
changes with greater ease than was 
npical for that environment 

We reviewed the application of 
classical Qeanroom and noted success- 
es and difficulties. The structure of 
independent teams and the emphasis 
on peer review during development 
was eas)* to 3pply. Howev er, the devel- 


opment team did have difficulty using 
the associated formal methods. Also, 
unlike the scheme in the classical 
Cleanroom method, the test team fol- 
lowed an approach that combined sta- 
tistical testing with traditional func- 
tional testing. 

I Finally, Ae psychological effects of 
j independent testing appeared to be 
! negligible. All team members indicated 
[ high job sarisfacdon as well as a will- 
ingness to apply the method in future 
projects. 

We packaged these early results in 
various reports and presentations, 
including some at the SEL’s 1990 
Software Engineering Workshop. As a 
reference tor future SEL Cleanroom 
projects, we also began efforts to pro- 


duce a document describing the SEL 
Cleanroom process model, including 
details on specific activities.^ (The 
completed document is now available 
to current Qeanroom projects.) 

Secoad case study. The first case study 
showed us that we needed better train- 
ing in the use of formal methods and 
more guidance in applying the resting 
approach. We also realized that experi- 
ences from the initial project team had 
to be disseminated and used. 

Again, we followed the Quality 
Improvement Paradigm. Wc selected 
two projects: one similar to the initial 
Cleanroom project. Project 2A, and 
one more representative of the typical 
FDD contractor-support environment, 
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Figure 3. Measwement coTnparisom for too case studies investigating Cleanroom, 
The first case study involved one project, Project L The second case study involved 
roo projects. Projects 2A and 2B. (A) Percentage of total development effort for 
various development actrvities, and (B) productivity in lines of code per day, change 
rate in changes per thousand lines of code, and reliability in errors per thousand 
lines of code. 


Project 2B. 

Projea 2A involved a different sub- 
system of another attitude ground-suj>- 
port s\’stem. This subsystem focused 
on the processing of tclemeny data, 
comprising 22,000 LOC. The project 
was staffed with four developers and 
two testers. Project 2B involved an 
entire mission attitude ground-support 
system, consisting of approximately 
160,000 LOC. At its peak, it was 
staffed with 14 developers and four 
testers. 

Setting goals and choosing processes. The 
second case study had two goals. One 
was to verify measures from the first 
study by applying the Cleanroom 
method to Project 2A, a project of 
similar size and scope. The second v’as 
to verify the applicability of 
Cleanroom on Project 2B, a substan- 
tially larger project but one more rep- 
resentative of the typical environment. 
We also wanted to farther tailor the 
process model to the enrironment by 
using results from the first case study 
and applying more formal techniques. 

Packages from die SEL Experience 
Factory (described in the box on p. 63) 
j were available to support project 
development. These included an 
evolved training program, a more 
knowledgeable experimenter team to 
monitor the projects, and several in- 
process interactive sessions with the 
project teams. Although we had begun 
producing a handbook detailing the 
SEL Cleanroom process model, it was 
not ready in time to give to the teams 
at the start of these projects. 

The project leader for the initial 
Cleanroom project participated as a 
member of the experimenter team, 
served as the process modeler for the 
handbook, and aaed as a consultant to 
the current projects. 

We modified the process according 
to the experiences of the Cleanroom 
team in the first study. Projea 1 ’s team 
had had difficulu’ using state machines 
in system design, so we changed the 
emphasis to .Mills’ box-structure algo- 
rithm.* We also added a more extensive 



TABLE 1 

PROJECT COMPARISONS FOR SEL TECHNOLOGY EVALUATION 

Evoluollon 
^ ospect 

1 

Controlled experiments 


Qeonroom case studies 


Reading vs.testing 

Cleonroom 

Project 1 

Project 2A 

Project 28 

Team size 

j 

1 

32 participants 

Thiee-peison develop- 
ment teams (10 experi- 
ment team^ five oonod 
teams); common inde- 
pendmt tester 

Three-person 
devebpment team; 
two-p»son test team 

Four-person 
development team; 
two-person test team 

Fourteen-person 
development team; 
four-person test 
team 


Project size 
and appli- 
cation 


Results 


Small (145-565 LOQ 
sample Fortran 
programs 


ZfUpCiXTDOTCt 

tzve than tesdng 
teduuques for c 
deoecQon 


IpOO LOC, Fortran, 
electronic message 
system for graduate 
laboratory course 

Qeanroom teams 
use fewer computer 
resources, sad^ 
requirements more 
successfully, and 
make lu^er percent- 
age of soieduled 
deliveries 


40,000 LOC, Fortran, 22.000 LOC, Fortran, 
flight- dynamics flight-dynamics 
ground-support ground- support 

system system 

Projea spends higher Project contmues 
per^tage of eSon trend in better relia- 


160,000 LOQ 
Fortran, flight- 
dynamics ground- 
support system 


in design, uses fewei 
computer resources, 
and achieves beccer 
productivity and refi- 
abiUty than environ- 
ment baseline 


bility while maintain- 
Daseline produc- 
tivity 


Project reliability 
only slightly beccer 
than baseline vdiile 
iroducdvicv’ foils 
low baseline 
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>el< 


t 

i; 

,t 

I* 
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training program focusing on Clean- 
room techniques, experiences from the 
initial Cleanroom team, and the rela- 
tionship between the Cleanroom stud- 
ies and the SEL’s general goals. The 
instruction team included representa- 
tives from the SEL, members of the 
initial team, and iVIills. Alills gave 
on various aspects of the methodology, 
as well as mothanonal remarks on die 
potential benefits of the Cleanroom 
method in the software community. 

Project 2 A ran from March 1990 
through January 1992. Project 2B 
ran from February' 1990 through 
December 1992. Again, we examined 
reliability, productivity, and process 
characteristics, comparing them to 
Projea 1 results and the SEL baseline. 

Analyzing and packaging resuhs. As Figure 
3 shows, there were significant differ- 
ences between die two projects. Error 
and change rates for Projea 2A contin- 
ued to be favorable. Productivity' rate, 
how*c\’er, returned to the SEL baseline 
value. Error and change rates for 
Project 2B increased from Projea 1 val- 
ues, although they remained lower than 
SEL baseline numbers. Productivity, 
howe\*er, dropped below the baseline. 

\\*hen we e.xamined the effort dis- 
tribution among the baseline and 
Projects I, 2A, and 2B, we found a 


continuing upward trend in the per- 
centage of design effort, and a corre- 
sponding decrease in coding effort. 
Additional analysis indicated that 
although the overall error rates were 
I below the baseline, the percentage of 
system components found to contain 
errors during testing w*as still represen- 
tative of baseline projects developed in 
this environment. This suggests that 
the breadth of error distribution did 
not change with the Cleanroom 
I method. 

1 In addition to evaluating objective 
, data for these t%vo projects, we gach- 
j ered subjective input through written 
and verbal feedback from projea par- 
I ticipants. In general, input from 
I Projea 2A team members, the smaller 
! of the two projects, w-as very favorable, 

I while Projea 2B members, the larger 
I contraaor team, had significant reser- 
vations about the method's application. 
Interestingly, though, specific short- 
I comings w*ere remarkably similar for 
i both teams. Four areas w*ere generally 
I cited in the comments. Participants 
I were dissatisfied with the use of design 
: abstractions and box structures, did not 
fully accept the rationale for haring no 
^ developer compilation, had problems 
coordinating information between 
developers and testers, and cited the 
need for a reference to the SEL Clean- 


I room process model. 

Again, we packaged these results 
, into various reports and presentations, 
! which formed the basis for additional 
process tailoring. 

Third cose study. W”e have recently 
begun a third case study to examine 
I difficulties in scaling up the Cleanroom 
] method in the typical contractor- 
support environment and to verify pre- 
; vious trends and anaK'ze additional tai- 
! bring of the SEL process model. VVe 
I expect the study to complete in 
I September. 

In keeping with this goal, we again 
t scleacd a projea representative of the 
i FDD contractor-support environment, 
but one that was estimated at 1 10,000 
: LOC, somewhat smaller than Projea 
: 2B. The projea involves development 
I of another entire mission attitude 
i ground-suppon system. Several team 
I members have prior experience with 
the Cleanroom method through previ- 
ous SEL studies. 

Experience Faaory packages avail- 
able to this projea include training in 
the Cleanroom method, an experienced 
e.xperimenter team, and the SEL 
Cleanroom Process Model (the completed 
handbook). In addition to modifying the 
process model according to the results 
from the first two case studies, we are 
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p^o^^ding regulark scheduled sessions in 
which the ream members and experi- 
menters can interact. These sessions ; 
give team members the opportunity to ; 
communicate problems the\^ are having 
in apphing die method, ask for clarifica- 
rion, and get feedback on their actrvines. 
This acthit)^ is aimed at closing a com- 
; municadon gap that the contractor team 
II felt existed in Projea 2B. 

T he concepts associated uith the QIP 
I and its use of measurement have 
gh’en us an evolutionary framework for 
I understanding, assessing, and packaging 
! the SEL’s e.\perienoes. 

I Table 1 shows how the evolution of 
' our Cleanroom study progressed as we 
used measurements from each experi- 
' ment and case study to define the next 
|j experiment or study. The SEL Qean- j 
j| room process model has evolved on the j 
j| basis of results packaged through earlier 
j| ei-aluarions. Some aspects of the target 
il methodology continue to evolve: Ex- 
!| perimentation with formal methods has 
;l transitioned from funcdonal decomposi- 
i don and state machines to box-structure 
{ design and again to box-structure appli- 
I cation as a way to abstract requirements. 

! Testing has shifted from a combined 
Ij siatistical/functional approach, to a 
t purely statistical approach based on 
i operational scenarios. Our current case 
i study is examining the effect of alloi^ing 
developer compilation. 

■ Along the way, we have eliminated 
I some aspects of the candidate process; 

we have not examined reliability models, ^ 
' for example, since the environment does 
ji not currendy have sufficient data to seed 
ij them. Wt have also emphasized some 
j aspects. For example, we are conducting 
studies that focus on the effect of peer 
reiiew's and independent test teams for 
I non-Cleanroom projects. We are also 
•: srudving how to improve reading by 

I de\'cloping reading techniques through 
off-line experimentation. 

The SEL baseline used for compari- 
son is undergoing continual evoiudon. 
i Promising techniques are filtered into 
[: the de\'elopment organization as general 


process impro^'ements. and correspond- 
ing measures of the modified process 
(effort distribution, reliabiiins cost) indi- 
cate the effea on the baseline. 

The SEL Cleanroom process 
model has evolved to a point where it 
appears applicable to smaller projects 
(fewer than 50,000 LOG), but addi- 
rional understanding and tailoring is 
still required for larger scale efforts. 
The model will continue to evolve as 
we gain more data from development 
projects. Measurement will provide 
baselines for comparison, identify 
areas of concern and improvement, 
and provide insight into the effects of 
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, process modifications. In this way, 

; we can set quantitative expectations 
! and evaluate the degree to which 
goals have been achieved. 

By adhering to the Qualin- Im- 
provement Paradigm, we can refine 
j the process model from study to 
study, assessing strengths and weak- 
nesses, experiences, and goals. 
However, our investigation into the 
Cleanroom method illustrates that | 
the evolutionary' infusion of technol- 
ogy is not trivial and that process 
improvement depends on a struc- 
tured approach of understanding, 
assessment, and packaging. ♦ j 
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Abstract 

As shown by the work of Bertrand Meyer, it is pos- 
sible to simulate genericity using inheritance, but not 
vice-versa. This is because genericity is a parameter- 
ization mechanism with no way to deal with the 
polymorphic typing introduced using inheritance. 
Nevertheless, if we focus on the use of inheritance 
as an implementation technique, its key feature is 
the dynamic binding of self-referential operation 
calls. This turns out to be basically a parameteriza- 
tion mechanism that can in fact be simulated using 
generics and static binding. And for some applica- 
tions this approach may actually be of more than 
academic interest. 

Introduction 

In his classic paper on “Genericity versus 
Inheritance”, Bertrand Meyer concludes that inheri- 
tance caimot be simulated using genericity because 
genericity provides no mechanism for achieving the 
polymorphism of inheritance [Meyer 86]. This is, of 
course, true, since genericity is a parameterization 
mechanism, not a typing mechanism. However, as 
an implementation technique, rather than as a typing 
mechanism, the polymorphism of inheritance is pri- 
marily used to achieve the dynamic binding of self- 
referential calls to object operations (e.g., messages 
to self in Smalltalk). 

This is not a minor point. Wegner and Zdonik 
state that “In a world without self-reference, inheri- 
tance reduces to invocation and inheritance hierar- 
chies are simply tree-structured resource sharing hi- 


erarchies. However, recursive definitions are just as 
fundamental for objects as for functions and proce- 
dures.” [Wegner 88]. In effect, inheritance is not in- 
heritance without self-reference. In this paper I will 
show that this crucial self-reference property of in- 
heritance can, in fact, be simulated using genericity. 

Cook and Palsberg define a denotational seman- 
tics of self-referential inheritance equivalent to the 
traditional operational semantics using dynamic 
binding [Cook 89]. They use a “wrapper” function 
to parameterize the super- and self-references of a 
class. Tliese parameters are then “statically bound” 
using a fixed-point operation. Thus, self-reference 
becomes basically a parameterization problem, 
which can be handled quite well by generics. 

The following three sections show in detail how 
this is done. The first section reviews the general 
issues of self-reference in the traditional inheritance 
mechanism. The next section shows how generics 
can be used to parameterize this self-reference. 
Finally, the third section extends this approach to 
also parameterize superclass reference. 

The examples in this paper are written in Ada 
9X, the proposed revision to the Ada language 
[Ada9X94a] (likely to be approved in 1994). Ada 
9X has powerful features for both genericity and ob- 
ject-oriented inheritance and is therefore an excel- 
lent real-world vehicle for the discussion here. I will 
introduce and describe the Ada 9X mechanisms for 
inheritance and genericity as necessary in the fol- 
lowing. This should be sufficient for a self-con- 
tained reading of this paper, but it is by no means a 
complete overview of Ada 9X, or even its object-ori- 
ented features. For fuller discussions of Ada 9X, I 
refer the reader to the references [Ada9X 94a], 
[Ada9X 94b] and [Taft 93]. 
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Inheritance 

Hauk uses an instructive example to discuss the is- 
sues involved in inheritance and self-reference 
[Hauck 93]. This example is based on a class of ob- 
jects that service hardware ports. One can output 
characters and lines to such ports, with the output of 
lines defined in terms of the output of characters. 
We define this class in Ada 9X using the following 
package specification: 

package Port la 

typo Objoet io toggod prlvmto; 

procoduro Put(Ot in out Objoct; Ct in Charootor) / 
procoduro Put_l*ino(0: in out Objoct; Lt in String)! 

privato 

typo Objoct ia taggod rocord _ and rooord; 
ond Port; 

In Ada 9X, encapsulation is achieved by defin- 
ing abstract data types called private types. The type 
Port .Object is defined as a private type in the visi- 
ble part of the package specification above, with its 
fiill definition given in the private part. Public 
primitive operations on this paivate type are also de- 
clared in the visible part of the package specifica- 
tion. The implementations of these operations are 
given in the corresponding package body, which we 
will get to in a moment. 

The use of the keyword tagged in the definition 
of Port. Object signals the availability of the ob- 
ject-oriented features of type extension and dispatch- 
ing for this type. For example, suppose we wish to 
define a subclass of ports that buffer their output. 
We can define this as an extension of Port . object: 

with Port; 

packogo BufforodJPort la 

typo ^Joct(8lsot Poaitlvo) la 
now Port.ObJoet with privoto; 

procoduro Fluah(Ot In out Objoct); 

privoto 

typo Objoct (fiizoi Poaitivo) ia now Port. Objoct with 
rocord 

l^att Natural 0; 

Buff or I String (1. .Sizo) ; 
ond rocord; 

ond Bufforod^Port; 


The type Buffere<a_Port. object is a derived 
type of Port . Ob j ect extended with the components 
required to implement a buffer. The discriminant 
Size is used to set the maximum number of charac- 
ters stored in the buffer. A derived type inherits the 
primitive operations of its parent type. In this case, 
Buffered_Port. Object inherits the operations Put 
and Put_Line from Port .Object. An additional op- 
eration, Flush, is defined solely on the type 
Buf fered_Port .Object. 

Derived types are distinct types from their par- 
ent types. Thus, given the declarations: 

Ps Port .Objact; 

Bi Buffarad^Port.^Jact; 

the following assignment is illegal: 

P !■ B; — Typm aimmmtehl 

even though Buffered_port. object is derived 
from Port. Object. The following explicit conver- 
sion is legal: 

--An objmot of typm BuffaradLPort. Objacfe can bm 
— convartad to typm Port.Objmct 
Pi- Port.ObJact(B) 

but the converted value is of type Port . object, and 
the extension components in b are lost 

Ada 9X separates polymorphism from the basic 
tagged type construct through the concept of class- 
wide types. For example, there is a class-wide type 
denoted Port, object 'Class rooted in the tagged 
type Port. Object. A class-wide type includes all 
values of all types in the derivation class of its root 
tagged type. The derivation class of a tagged type 
includes the type and all descendant types derived 
from it 

Due to file availability of type extension, the size 
of a value of a class-wide type cannot generally be 
determined at compile time. Therefore, polymorphic 
variables in Ada 9X generally contain pointers to 
class-wide types. Points types in Ada are known as 
access types. Thus, given the following declarations: 

typ« Port_Boint*r iz Port. ObJwot'Clasz; 

typo Buff orod_Port_Polntor lo oooooo 
Buf f orod—Port . ObJ oc t ' Clooo I 

PP I Port_Pointor; 

BP X Buf f oro4_Port_Pointor 


10022514L 


4-4 



the following assignment is legal: 

— Pointmr to Port.ObjoGt^Clamm can point to 
-- Bxxfforod_Port*ClmMM ohimct 
PP tm BP; 

because the derivation class of Buffered_Port .ob- 
ject is contained in the derivation class of Port .ob- 
ject. 

In addition to allowing polymorphic variables, 
class-wide types also provide the mechanism for 
polymorphic dynamic binding of operations. Each 
value of a tagged type has a tag that identifies the 
dynamic type of that value. When a primitive opera- 
tion of a tagged type is passed a value of the corre- 
sponding class-wide type (which may actually be a 
value of any type derived from the root tagged type), 
the operation dispatches to the implementation iden- 
tified by the tag of the value. For example, given the 
above declarations: 

-- Noto tliMt doroformncoM pointors 

Port.Put (PP.all/C); — Boiuvl to Put iaplmmmntmtion 

in 2 >o4p' of Port 

PP BP; 

Port . Put (PP. all, C) ; — Bound to Put iaplmmmntmtion 

— in body of Buffmrmd^Port 

The second call to Port . Put is dynamically bound 
to Buffered_port.Put, because the tag of the ob- 
ject pointed to by pp after the assignment indicates 

type Buffered_Port .Object. 

Now let’s turn to the body of package Port. 
This body contains the implementations of the two 
operations on type port . ob j ect: 

packaga body Port la 

procadura Put(Oz in out Objact; C: in Charactar) la _ 
and Put ; 

procadura Put_blna(Oi in out <^jact; L: in String) ia 
bag in 

for X in L'Ranga loop 

Put (Objact'Claaa(O) ,L<I>) ; -- Badiapatching call 

and loop; 
and Put_^ina; 

and Port; 

Note the use of the conversion object' cias s ( o ) in 
the call above to the procedure Put. This conversion 
causes the call to Put to be dynamically bound, de- 
pending on the dynamic tag of the argument o. This 
is known as a redispatching call in Ada 9X, and it 
has the same effect as the use of seif in Smalltalk 


[Goldberg 83] or this in C++ (for a virtual func- 
tion) [Stroustrup 91]. 

The use of redispatching in the implementation 
of Put_Line makes the implementation of type 
Port. Object self -referential. This self-reference is 
very important for the implementation of the opera- 
tions of the derived type Buffered_Port. object. 
The body of package Buffered_Port must, of 
course, include the implementation of the new oper- 
ation Plush. In addition, the implementation of pro- 
cedure Put inherited from Port, object must be 
overridden with a new implementation that handles 
the buffering required for a b u f - 

f ered__Port . Ob j ect : 

packag* body Buff orod^Port ia 

procoduro Put(Ot In out Objoet; Ct in Charactor) ia 
bagin 

O.Laat im O.Laat 4 1; 

O.Buf far(O.Laat) C; 

if O.baat m O.Siza tban 

Pluah(Objact'Claaa(0>) ; — JtmdiMpm tchinff emll 

and if; 
and Put; 

procadura Fluab(Ot in out Objact) ia 
bagin 

for I in 1. .O.Laat loop 

Port . Put (Port . <a> jact (O) , O. Buf f ar (I ) ) ; 

— Stmticmlly^bomnd call 

and loop; 

O.Laat 0; 
and Pluah; 

and Buffarad^Port; 

Note the conversion port .object (O) in the stati- 
cally bound call to the parent operation Port . Put in 
the implementation of Flush. 

Since the procedure Put_Line is not overridden 
in the body of package But f ered_Port, its imple- 
mentation is inherited without change. This is shown 
diagrammatically in Figure 1, where the shading in- 
dicates that there is no implementation for put_Line 
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physically included in package Bufferea_port. 
Figure 1 also shows that the actual implementation 
for Put_Line in package Port makes a call on 
Port. Put. However, when this implementation is 
inherited in package Buf £ered_Port, the redispatch- 
ing call to Port .Put, when passed a value of type 
Buf fered_Port. Object, will now be dynamically 
bound to the overriding implementation of Buf- 
fered_Port.Put. Thus, the Characters in a line are 
all properly buffwed, even though the implementa- 
tion of procedure Put_Line has not changed. 

Genericity 

Consider again the hardware port example from the 
last section. We wish to implement the same 
Port. Object private type, with the same visible 
operations, but without the use of redispatching. 
Nevertheless, we wish to retain the ability to redirect 
the binding of self-referential calls in operations in- 
hoited by a descendant of Port . object. To do this 
we make this binding explicit using a generic pack- 
age nested within the specification of package Port: 

p«cXag* Port lo 

typo Objoct lo togffod privoto; 

procoduro Put<Os in out Solf; Ct in Choroctor) ; 

procoduro PutJLino(Oi in out Solf; L: in String); 

gonoric 

typo Solf (<>) io now Objoct with privoto; 
poekogo Oporotiono is 

procoduro Do_Put(Oi in out Solf; C: in Cboractor) ; 
procodiiro Do_^Put_I«ino(Ot in out Solf;Lt in String); 
ond Oporotiono; 

privoto 

typo Gbjoct is toggod rocord _ ond rooord; 
ond Port; 

While the type Port. object retains its opera- 
tions Put and Put_Line, the actual implementation 
of these operations are moved to the inner generic 
package Port .operations. This generic package is 
parameterized by the type self, which must be a de- 
scendant of Port. Object (or Port .Object itself). 
As a descendant of Port. object, any actual type 
bound to the parameter self will have Put and 
Put_Line operations. This binding of the parameter 
Self will be used in the implementation of the oper- 
ations Do_Put and Do_put_Line to replace any self- 


referential redispatching calls. 

The generic package operations is nested in- 
side Port so that its body has visibility to the full 
definition of the private type Port . ob j ect. This al- 
lows the subprogram oo_Put to be implemented file 
same way as Port .Put would have been in the last 
section (if we had actually shown it!). The imple- 
mentation of Do_put_Line is also Similar to the im- 
plementation of Port.Put_Line in the last section, 
but with a crucial difference: 

puokagc body Port ic 

paekag* body Oporation* ia 
prooaduro Do^Put 

(O: in out Saif; Cs in Cbarmotar) ia 
and Put; 

prooadura Do_Put_Zd.na(0:in out Salf;Lii& String) la 
bagin 

for I in If'Ranga loop 

Put(0,L(X)); Statioaily-bound call 

and loop; 
and Put..l*ina; 

and Gparationa; 

packaga Salf_Oparationa ia 
naw Oparationa (Port .Obj act) ; 

procadura Put(Ot in out Saif; Ct in Charactar) 
ranaaaa Sal f _Oparat iona . Do_Pu t ; 

procadura Put_Xiina(Ot in out Saif; Lt in String) 
ranaaaa Sal f _Oparationa . X>o_Pu t..Jiina ; 

and Port; 

In place of die redispatching call in the implementa- 
tion of Put_Line there is now a statically-bound call 
in procedure Do_Put_ijine to the operation Put on 
type Self. Rather than using redispatching, self-ref- 
erence is achieved by instantiating the generic pack- 
age Operations in the body of package Port. This 
instantiation effectively provides the fixed-point op- 
eration of Cook and Palsberg. 

The package seif_operations is an instantia- 
tion of the generic package operations with type 
Port .Object used for the parameter self. The pro- 
cedures Put and Put_Line are then simply renam- 
ings of the real implementations from seif_op- 
e rat ions (which have the correct argument type 
profiles, since self is Port. object for Self_Op- 
erationsi). The implementation of seif_opera- 
tions .Do_Put_Line contains a call to the op^ation 
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Put for the generic type parameter self. Since the 
type parameter self is bound to port .object for 
this instantiation, its Put operation is simply 
Port. Put, which is a renaming of seif__opera- 
tions .Do_Put. Thus Port .Put_Line self-referen- 
tially calls seif_operations.Do__Put, as shown in 
Figure 2. 

Note, however, that Port . Put^Line now makes 
a statically-bound call to Port .Put. Thus this call 
will not be automatically redirected to Buf- 
fered_Port.Put in the inherited operation Buf- 
fered_Port.Put_Line. Instead, we must instantiate 
the generic package Port .operations differently 
for the implementation of the suf fered_Port op- 
erations, so as to achieve the correct bindings. To 
see how this is done, let’s turn next to the implemen- 
tation of Buff ered_Port .Object using the OUT new 
approach. 

As we did with package port, we include a 
nested generic package within package Buf- 
fer ed_Port: 

with Port; 

paclcag* Buf £arad_Port la 

typ* Objact(Sixaz Poaitlva) ia 

naw Port.Objact with privata; 

procadura Pluah(Oi in out Saif); 

ganaric 

typa Saif (<>) ia naw Objact with privata; 


packaga Oparationa ia 

procadura I>o_Put(0: in out Saif; Ci in Char ac tar ) ; 
procadura Do_Put_Lina (O t in out Salf;L> in String); 
procadura Do_Fluah(Ox in out Saif); 
and Oparationa; 

privata 

typa Objact(Sixaz Poaitiva) ia naw Port.Objact with 
racord 

baat I Natural tm 0 ; 

Buffar: 8tring(l. .Siza) ; 
and racord; 

and Buf farad_Port; 

Note that inner"’ generic package suf- 
fered_Port. Operations contains implementations 
for the inherited operations Put and Put_Line as 
well as the new buffered port operation piush: 

packaga body Buffarad^Port ia 

packaga body Oparationa ia 

packaga 8upar_Opozationa ia 
naw Port. Oparationa ( Saif ) ; 

procadura Do_Put(Ot In out Salf;Ct ,in Cbaraetar) ia 
bag in 

O.Laat !■ O.Laat 1; 

O. Buffar (O.Laat) C; 

if O.X^aat - O.Siza than 

Fluah(O); — Statically-bound call 

and if; 
and Put; 

procadura Do_Put__l.ina(0* in out Saif; L: in String) 
ranaaaa Supar^Oparat iona . Do_Put^ I t i na ; 
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proc«dur« Do_Flush(Ot in out ^joct) !• 
bogin 

for I in 1. .O.baot loop 

8upor_Opor«tion«.Do_Put (0,0.Buf for(Z) )| 

Stmticmlly~hox 2 Jid ooll 

•nd loop; 

O.Z«Mt tv 0; 

•nd Fluoh; 

•nd Op^rmtiono; 

paelMgo 8olf_0p^rntion« io 

MW Buf f •r^d__Port. Qp«r«tiono (Buff orod^Port .Object) ; 

pToo^duro Put(Ot in out Objoct; Ci in Cbaractor) 
Tmnummm 8olf_0pox«tiono.Do_Put; 

prooodur* Put_bin«(0i in out Objoct; L: in String) 
ronoMO Solf _0pormtiono . Do_Put_Lino; 

prooodur* Flu«h(Oi in out Object) 
r*D«Mo Solf _Op«ration« . Do_Fluoh ; 

•nd Buff«rod_Portj 

Note the nested instantiation of port. operations 
within Buf£ered_Port. Operations, passing along 
the correct binding for self. 

As shown in Figure 2, the instantiation Buf- 
fered_port.seif_operations appropriately redi- 
rects the self-referential calls to Put and Flush to the 
implementations as required. The nested instantia- 
tion of Port . Operat 1 o n s within Buf- 

fered_Port.Self_Operatlons aSSUreS that even 
references to Put in Buffered_Port.Super_Opera- 
tl«is.l)o_Put_Llne now call B u f- 
f ered_Port . Sel f _Operat Ions . Do_Put. 

Mixins 

The wrapper functions of Cook and Palsberg param- 
eterize both the super- and self-references of a class 
[Cook 89]. In the last section we used generics to 
parameterize the self-references. An extension of 
this approach can be used to parameterize sup^class 
references as well. 

To do this, we first turn the package defining the 
subclass type into a generic package with the supa-- 
class type as a generic parameter. Such a generic 
package provides an independent increment of func- 
tionality that can be added on to any appropriate su- 
perclass type. We will call such a package a mixin, 
since its functionality can be “mixed into” the super- 
class. The term “mixin” comes originally firom the 
LISP-based Flavors system [Moon 86] and is usually 
used in conjunction with multiple inheritance. The 


mixins we will define here are closer in spirit to the 
gen^alized concept proposed by Bracha and Cook 
[Bracha90]. (See also the Ada 9X Rationale 
[Ada9X 94b] for a discussion of using generics as 
mixins in Ada 9X; I have also previously described 
how mixins can even be created in non-object-ori- 
ented Ada 83 [Seidewitz 92].) 

For example, consider the buffered port class. 
We can turn this class into a mixin by replacing its 
superclass dependency on the port class with a 
generic parameter: 

g«n«ric 

typm Bl«Mnt Lb private; 

typ^ Sup«r(<>) Lb abatract taggad privata; 
packa ga Buffar_Mi3cln ia 

typ« Obj«ct(8lxai Foait^iva) ia abatract mw Supar 
with privata; 

gaMric 

typa Saif (<>) ia mw Objact with privata; 
with procadura Supar^Put 

<0: in out Saif; B; in Blaaant); 
with procadura Saif _FIuah <0 t in out Saif); 
packaga CparatioM ia 

procadura Do_Put(Ot in out Saif; B> in Blaoant); 
procadura Do_Fluah(Ot in out Saif); 
and Qparationa; 

privata 

typa BlaMnt.Jkrray ia 

array (Poaitiva ranga <>} of Blanant; 

typa Objact(Sisai Poaitiva) ia 
abatract mw Supar with 
record 

Laat: Natural 0; 

Buffart Blanant^krrayd. .Sisa) ; 
and record; 

•nd Buf farad_Port; 

The type parameter super provides the required pa- 
rameterization of the superclass type. The type 
Bu££er_Mixin. Object is then derived from this 
generic parameter. Since we needed to make this a 
generic package anyway, the buffer mixin is further 
generalized above by using the genaic type parame- 
tex Element (which does not need to be tagged) in 
place of Character. 

Note that the type parameter super is declared 
to be abstract. This means that the actual type used 
for this parameter may be an abstract type (though it 
may also be non-abstract). It is illegal to aeate ob- 
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jects of an abstract type, though there may be objects 
of non-abstract descendants of the abstract type. 
Further, an abstract type may have abstract opera- 
tions that have no implementations (these are 
equivalent to pure virtual functions in C++ 
[Stroustrup 91] or deferred routines in Eiffel 
[Meyer 88]). Non-abstract descendants of an abstract 
type must override all abstract operations with non- 
abstract implementations. The type Buf fer_Mix- 
in. Object is also declared to be abstract, since it 
may inherit abstract operations from super. 

The type parameter super is also not con- 
strained to be a descendant of any known type. 
Therefore, within the body of the generic package, 
there are no primitive operations guaranteed to be 
available for super (except for some basic opera- 
tions like equality, but that’s a technicality). Since 
there are no known operations to be inherited from 
Super, and no other operations are defined for it, the 
type Buff er_Mixin. Object also has no known 
jximitive operations. Instead, this type only provides 
a basis for defining the implementations of the oper- 
ations given in the generic package Buffer_Mix- 

In . Ope r at i ons . 

As before, the generic package operations is 
parameterized by the derived type parameter self. 
Now, however, there are no known primitive opera- 
tions to be inherited from Buf fer_Mixin. object. 
Instead, the only operations on self are those that 
are explicitly given as generic parameters, in this 
case super_Put and seif_Piush. As the names in- 
dicate, the super_Put parameter is intended to pro- 
vide the superclass Put operation, while the 
seif_piush parameter provides the self-referential 
piusb operation. Thus, this generic clause defines 
the complete inheritance interface for the buffer 
mixin. (As will become clearer in a moment, the 
super_Put operation is defined on the type self 
rather than super to ensure the correct binding of 
"any s.elf-referential calls it may make.) 

Calls to the operations given by super_Put and 
seif_piush are now the only external calls that can 
be made on type self in the implementations of 
Do_Put and Do_piush; 

package body Buff ar_^Cixin ia 


packaga body Oparatlona ia 

procadura Do_Put(Ot in out Saif; Bt in Blamant) ia 
bag in 

O.Laat :■ O.Laat * 1; 

O.Buff ar (O.Laat) >■ B; 
if O.X«aat ■ O.Siza than 

Salf_Fluah(0) ; — StMticMlly-bound call 

and if; 
and Put; 

procadura Do_Fluah(Ot in out Saif) ia 
bagin 

for X in 1..0.Ijaat loop 
Supar_Put (0,0. Buf far (I) ) ; 

— Stmt iomlXy •bound cmll 

and loop; 

O.Laat 0; 
and Fluah; 

and Oparationa; 

and Buf far_llixin; 

Note that this buffer mixin does not define a 
Do_Put_Line operation. This is because a mixin 
should represent a discrete increment of functional- 
ity, and the ability to put a line is not really part of 
the buffering functionality as defined here. 

As defined in the previous sections, the port 
class does not have any superclass. However, for 
consistency, we can also turn this class into a mixin: 

ganarlc 

typa Supar(<>) ia aba tract taggad privata; 
packaga Port_Mixin ia 

typa Ob j act ia aba tract naw Supar with privata; 
ganaric 

typa Saif (<>) ia naw Objact with privata; 
with procadura Salf_Put 

(O: in out Saif; C: in Charactar); 
packaga Oparationa ia 

procadura Do_Put(0: in out Saif; C: in Charactar); 
procadura Do_Put_Lina(Ot in out 8alf;L: in String); 
and Oparationa; 

privata 

typa Objact ia aba tract naw Supar with 
racord ... and racord; 

and Port; 

packaga body Port_Mixin ia 

packaga body Oparationa ia 

procadura Do_Put(Ox in out Salf;Ct in Charactar) ia 
... and Put; 

procadura Do_Put_Lina (0:in out Salf;Lxin String) ia 
bagin 

for I in L'Ranga loop 
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S*lf_Put(0,L(I) ) ; — Stmtlcmlly^botind cmll 

•ad loop; 

•nd Put_X*in«; 

•nd Op«ratioaa; 

•ad Port_|Cixla; 

Even though the hardware port functionality does 
not require any superclass operations, this mixin al- 
lows such functionality to be freely mixed in as part 
of any class implementation. 

Note that there is no typing relationship at all 
between the port and buffer mixins. Mixins provide 
incremental implementation completely indepen- 
dently of problem-domain typing relationships. As a 
complement to these mixins, we can define a set of 
abstract types that capture typing relationships com- 
pletely independently of implementation details. 

For example, we can use two abstract types to 
define the sup>ertype/subtype relationship between 
ports and buffered ports; 

Port_iyp^« !• 

typ* Port !• abotroet tagpod null raoord; 

procadura Put(Ot la out Porti Ct la Charaotar) la 
abatraoti 

procadura Put_blaa(Oi in out Port/ Li in String) la 
abatraot; 

typa Buffarad_^ort la 

aba tract naw Port with null racord; 

procadura Pluah(Ot In out Buf farad^Port) la abatract; 

and Port_Typaa; 


For simplicity, this one package defines both ab- 
stract types, though they could equally well have 
been defined in separate packages. 

To actually implement the port and buffered port 
classes, we need to bring together the fimctionality 
implemented in the port and buffer mixins with the 
type hierarchy defined by the port and buffered port 
abstract types. The following shows how this is done 
for the buffered port class: 

with Port_Typaa, Port_|Clxin, ^ffarJKixin; 
pacfcaga Buf faradLPcrt la 

typa Object la 

naw Port_Typaa.Buffarad_Port with private; 
private 

package Port^aplaaantatlon la 

naw Port JKixln< Port _Typaa . Buf f arad_Port ) ; 

package Buf£arad^ort_IapXaaantation la 
naw Buf£ar_|fixln 

(Cbaraetar« Port^ZapXanantation. Object) ; 
typa Object la 

naw Bu£farad,_Port^^Xanantatlon.<^jact with 
nuXX racord; 

•nd Bu££arad_Port; 

The instantiations of the two mixins incrementally 
builds the implementation of the type Buf- 
f ered_P or t .Object. 

As shown in Figure 3, the instantiation 
Port_ixBpiamentation adds poit-related compo- 
nents to the type Port_Types .Buffered_Port 
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(which has no components itself), producing the 

typ>e Port_Implementatlon. Object (this iS alSO an 

example of why we need to allow mixin generics to 
be instantiated with abstract types). The instantiation 

Buf fered_Port_Iniplemeatatlon then extends the 
tyf>e port_impieanentation. Object with buffer-re- 
lated components, producing the type Buf- 

fered_Port_Implemeiitatlon .Object. The full def- 
inition of Buffered_Port. Object is a null exten- 
sion of Buffered_Port_Implementation. Object. 

The partial view of Buff ered_Port .object 
given in the visible part of package Buf f ered_Port 
declares this type to be a descendant of 
port_Types . Port . The full definition of Buf- 
fered_Port .Object given in the private part of the 
package is indeed a descendant of the abstract type 
Port_Types . Port via the type extensions resulting 
from the two mixin instantiations and the final null 
extension. As such, it inherits the three abstract op- 
CTations Put, Put_Llne and Flush. However, Buf- 
fered_Port. Object is not declared to be abstract 
and so must provide implementations for these in- 
herited operations. 

Hie implementations of the Buf f ered_Port .ob- 
ject operations are, of course, given in the body of 
package Buf fered_Port, using the Operations 
gen^c packages from the port and buffer mixins: 

packag* body Buff •rad_Port la 

paclcag* For t^Oparat ions is 

naw Port.Iaplaaantation.Oparations 
(Buff arad^Port .Objact, Put); 

paekaga Buf fa rad_Port_Opa rations is 

naw Buf f arad_Port_Zaplaokantation . Oparations 
( Buffarad^Port.Objact, 

Port_Oparations .Do_Put, 

Flush) ; 

procadura Put(0: in out Objact; Ci in Cbaractar) 
ranaass Buff arad_Port_Oparations .Do_Put; 

procadura Put_X*ina(Ot in out Objact; Li in String) 
ranaaas Po rt^Opara ti ons . Do_Put_Lina ; 

procadura Flush (0< In out Objact) 

ranaaas Buf f arad_Port_Oparations .Do^Plush; 

and Buffarad_Port; 

Since the buffer implementation is now independent 
of the port implementation, both operations 
generic packages must be instantiated here. The in- 


stantiation of Port_Implementation. Operations 
uses operation Buffered_Port.put for the self-ref- 
erential seif_Put generic parameter. The instantia- 
tion of Buf fered_Port_IlIlplementation.Ope^a- 
tions uses operation Buf fered_Port. Flush for the 
self-referential seif_Fiush parameter. However, it 
uses the operation Port_Qperations.Do_Put, not 
Buffered_Port.Put, for the superclass operation 
super_Put. (This also shows why superclass opera- 
tions must be parametors of the inner generic pack- 
age Operations in a mixin.) 

The actual Buf fered_port. object operations 
are once again defined as renamings of subprograms 
from the instantiated operations packages. Note, 
however, that Put_Line is taken from Port_opera- 
tions, not Buf fered_Port_Operat ions, sinCC the 

buffer mixin does not implement a Put_Line opera- 
tion. Nevertheless, the generic instantiations insure 
that Buf f ered_Port . put_Line is implemented with 
a proper self-referential call to Buf fered_Port_op- 
erations.Do_put (the reader can trace how this 
happens using Rgure 4). 

Conclusion 

The use of generics for the static-binding of self-ref- 
erential calls is at least of academic interest in the 
comparison of inheritance and genericity. However, 
since the generic ^iproach can be a bit cumbersome, 
one may ask if it has any practical application. In 
fact, there are some good reasons to consider this 
approach: 

1. Experience has shown that the common use of 
self-reference with inheritance can make an ob- 
ject-oriented program difficult to understand and 
change (see, for example, [Taenzer 89], 
[Leijter92], [Wild 92] and [Wild 93]). The 
generic approach gives the programmer much 
more precise control about when and where 
these self-reforential bindings are made and thus 
makes the use and intent of self-reference more 
apparent to the maintainor. 

2. In many safety-critical applications (such as 
avionics software), any “dynamic” construct 
(dynamic memory allocation, dynamics binding, 
etc.) is regarded with suspicion. This is because 
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such features make it much harder to verify that 
a program meets stringent safety requirements. 
The generic approach provides self-reference 
and deferred operation implementation with 
fully static binding. 

3. For a generic mixin, the generic clause of the in- 
ner Operations generic package effectively 
gives a complete “typing” of the inheritance in- 
terface. That is, it explicitly lists all operations 
required from the superclass and all operations 
called self-referentially. As described by Hauk, 
such a complete typing allows the type-safe re- 
placement of a superclass implementation dur- 
ing class library maintenance [Hauk 93] (see 
also [Gibbs 90] on the issues of modifying class 
hierarchies). For example, in the b u f - 
f erod_port implementation given in the last 
section, the use of the Port_Mixin could be 
easily replaced with a different implementation 
of the hardware port functionality, so long as it 
provided the Put operation needed by Buf- 
fer_Mixin. Such a replacement requires no 


changes to the implementation of the buffering 
functionality, nor any changes to the clients of 
Buf fered_Port. For that matter, it would be 
equally easy to rq>lace the Buf f er_Mixin with a 
different implementation of the buffering func- 
tionality. 

Meyer was indeed correct in concluding that 
genericity caimot be used to fully simulate inheri- 
tance. However, inheritance is a much more expan- 
sive mechanism than genericity, and thus the com- 
parison with genericity is not entirely fair. We can 
decompose the inheritance mechanism as type ex- 
tension plus polymorphic typing plus self-reference. 
Genaicity is only comparable to the parameteriza- 
tion-oriented effect of self-reference in the inheri- 
tance mechanism. As we have seen in this paper, in- 
heritance actually can be simulated by type exten- 
sion plus polymorphic typing plus genericity, and 
that the genoic approach actually has some potential 
advantages. 
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